Netdev List
 help / color / mirror / Atom feed
* Re: [patch 09/24] timekeeping: Add CLOCK_AUX support for ktime_get_snapshot_id()
From: Thomas Weißschuh @ 2026-06-26 11:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, David Woodhouse, Miroslav Lichvar, John Stultz,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker,
	Arthur Kiyanovski, Rodolfo Giometti, Vincent Donnefort,
	Marc Zyngier, Oliver Upton, kvmarm, Oliver Upton, Richard Cochran,
	netdev, Takashi Iwai, Miri Korenblit, Johannes Berg, Jacob Keller,
	Tony Nguyen, Saeed Mahameed, Peter Hilber, Michael S. Tsirkin,
	virtualization, linux-wireless, linux-sound
In-Reply-To: <87echtk24a.ffs@fw13>

On Fri, Jun 26, 2026 at 12:49:41PM +0200, Thomas Gleixner wrote:
> On Fri, Jun 26 2026 at 10:48, Thomas Weißschuh wrote:
> > On Tue, May 26, 2026 at 07:14:13PM +0200, Thomas Gleixner wrote:
> > (...)
> >
> >>  static inline void tk_update_aux_offs(struct timekeeper *tk, ktime_t offs)
> >> @@ -1218,6 +1223,12 @@ bool ktime_get_snapshot_id(struct system
> >>  		tkd = &tk_core;
> >>  		offs = &tk_core.timekeeper.offs_boot;
> >>  		break;
> >> +	case CLOCK_AUX ... CLOCK_AUX_LAST:
> >> +		tkd = aux_get_tk_data(clock_id);
> >> +		if (!tkd)
> >> +			return false;
> >> +		offs = &tkd->timekeeper.offs_aux;
> >> +		break;
> >
> > 'tkd' is also used to compute 'monoraw'. However 'tkr_raw' and 'tkr_mono'
> > are the same for auxilary clocks, so this will compute a wrong 'monoraw'.
> 
> AUX clocks are independent in the first place and the MONORAW part is
> the "MONORAW" related to the AUX clock itself. 
> 
> > Instead 'monoraw' should be computed based on 'tk_core'.
> > Which then also requires the sequence locking of 'tk_core'.
> 
> No. From a PTP and steering point of view you want the "raw" value which
> is related to the AUX clock itself and not the global one.

Ack.

However the kdocs call it 'CLOCK_MONOTONIC_RAW'. Can we clean this up?

> The global MONORAW and the AUX clock MONORAW are related obviously as
> they share the same conversion factors.
> 
>      MONORAW(AUX$N) = MONORAW(GLOBAL) + OFFSET(AUX$N)
> 
> We don't have that information anywhere right now and we might want to
> expose it to allow user space a proper correlation, but that's an
> orthogonal problem.
> 
> From a PTP/steering point of view the code is correct as is.

Thanks for the clarifications.


Thomas

^ permalink raw reply

* Re: [PATCH net v7 3/4] iavf: send MAC change request synchronously
From: Jose Ignacio Tornos Martinez @ 2026-06-26 10:54 UTC (permalink / raw)
  To: horms
  Cc: aleksandr.loktionov, anthony.l.nguyen, davem, edumazet,
	intel-wired-lan, jacob.e.keller, jtornosm, kuba, netdev, pabeni,
	przemyslaw.kitszel
In-Reply-To: <20260624163148.1146093-2-horms@kernel.org>

>>       if (!ether_addr_equal(netdev->dev_addr, addr->sa_data))
>>               return -EACCES;

> [Severity: High]
> Will this cause a persistent state mismatch if a timeout is followed by a
> retry?
>
> If iavf_set_mac_sync() times out, it returns -EAGAIN and bypasses this
> rollback, leaving adapter->hw.mac.addr updated but netdev->dev_addr at the
> old MAC.
>
> If the user then retries before the PF asynchronously completes the original
> request, iavf_add_ether_addrs() returns -EBUSY. Since -EBUSY != -EAGAIN,
> the rollback runs, forcibly reverting adapter->hw.mac.addr to the old MAC.
> 
> When the PF's delayed success response finally arrives,
> iavf_virtchnl_completion() sees the old MAC in both hw.mac.addr and
> netdev->dev_addr, bypassing the netdev MAC update. The PF hardware is now
> programmed with the new MAC, but the kernel retains the old MAC, dropping
> traffic.

This scenario requires an extremely narrow timing window: a 2.5s PF timeout
(rare in normal operation, PF typically responds much faster), followed by an
immediate retry before adminq_task clears current_op (microsecond window).

Even if this race occurs, it's recoverable - the retry fails with -EBUSY
and the user retries again successfully. This is a transient failure, not
permanent state corruption.

>>       if (adapter->current_op != VIRTCHNL_OP_UNKNOWN) {
>>               /* bail because we already have a command pending */
>>               dev_err(&adapter->pdev->dev, "Cannot add filters, command %d pending\n",
>>                      adapter->current_op);
>> -             return;
>> +             return -EBUSY;
>>       }
>
> [Severity: High]
> Could routine background tasks cause MAC changes to frequently fail?
>
> The watchdog task routinely sets adapter->current_op for background commands
> like VIRTCHNL_OP_GET_STATS. If iavf_set_mac() is called during this window,
> iavf_set_mac_sync() will hit this check and immediately fail the MAC change
> with -EBUSY.
>
> Because there is no retry logic in iavf_set_mac(), this error propagates
> directly to userspace, making MAC address configuration randomly fail based
> on watchdog timing.

Background virtchnl operations (like GET_STATS) complete quickly, typically
within milliseconds. The collision window is very small.

Fail-fast with -EBUSY is semantically correct and allows user-space retry.
If a collision occurs, it's spurious and recoverable immediately on retry.
This is acceptable compared to the complexity and deadlock risks of queuing
or waiting for background operations to complete while holding netdev_lock.

>> -static void iavf_mac_add_reject(struct iavf_adapter *adapter)
>> + */
>> +void iavf_mac_add_reject(struct iavf_adapter *adapter)
>>  {
>>       struct net_device *netdev = adapter->netdev;
>>       struct iavf_mac_filter *f, *ftmp;
>
> [Severity: Medium]
> Does exporting and reusing iavf_mac_add_reject() for local rollback cause
> unintended background MAC changes?
>
> When the user requests a MAC address that is already in the filter list
> (e.g., a secondary MAC), iavf_add_filter() sets f->is_new_mac = false and
> f->add = true, and flags IAVF_FLAG_AQ_ADD_MAC_FILTER.
>
> If the synchronous operation fails locally (like the -EBUSY return above),
> iavf_set_mac() calls this function. However, iavf_mac_add_reject() only
> clears the pending state and removes the filter if f->is_new_mac is true:
>
>    if (f->is_new_mac) {
>        list_del(&f->list);
>        kfree(f);
>    }
>
> Reused filters are ignored, leaving f->add = true and the
> IAVF_FLAG_AQ_ADD_MAC_FILTER flag active. The background watchdog task will
> eventually process this flag and blindly send the MAC configuration to the
> PF, even though the VF already aborted the operation locally.

This scenario only occurs when setting the primary MAC to an address that
already exists as a secondary MAC in the filter list - an extremely rare
configuration.

Even if this occurs and the watchdog later sends the MAC to the PF, it is
harmless: the MAC is already configured on the PF (as a secondary), so the
redundant ADD_ETH_ADDR message has no adverse effect.

The common case - changing primary MAC to a new address - uses is_new_mac =
true and is handled correctly by the rollback logic.



These concerns represent theoretical edge cases that are extremely unlikely
in practice. The synchronous approach fixes a reproducible deadlock affecting
all users in production, allowing the user to retry and complete the
operation. The trade-off is justified.


^ permalink raw reply

* Re: [PATCH 4/6] dt-bindings: net: bluetooth: Document Qualcomm IPQ5018 Bluetooth controller
From: Krzysztof Kozlowski @ 2026-06-26 10:53 UTC (permalink / raw)
  To: George Moussalem
  Cc: Jens Axboe, Ulf Hansson, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, Johannes Berg, Jeff Johnson, Bartosz Golaszewski,
	Marcel Holtmann, Luiz Augusto von Dentz, Balakrishna Godavarthi,
	Rocky Liao, Saravana Kannan, Andrew Lunn, Heiner Kallweit,
	Russell King, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Bjorn Andersson, Konrad Dybcio,
	Mathieu Poirier, Philipp Zabel, linux-block, linux-kernel,
	linux-mmc, devicetree, linux-wireless, ath10k, linux-arm-msm,
	linux-bluetooth, netdev, linux-remoteproc
In-Reply-To: <20260625-ipq5018-bluetooth-v1-4-d999be0e04f7@outlook.com>

On Thu, Jun 25, 2026 at 06:10:08PM +0400, George Moussalem wrote:
> Document the Qualcomm IPQ5018 Bluetooth controller.
> 
> Signed-off-by: George Moussalem <george.moussalem@outlook.com>
> ---
>  .../bindings/net/bluetooth/qcom,ipq5018-bt.yaml    | 63 ++++++++++++++++++++++
>  1 file changed, 63 insertions(+)
> 
> diff --git a/Documentation/devicetree/bindings/net/bluetooth/qcom,ipq5018-bt.yaml b/Documentation/devicetree/bindings/net/bluetooth/qcom,ipq5018-bt.yaml
> new file mode 100644
> index 000000000000..afd33f851858
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/bluetooth/qcom,ipq5018-bt.yaml
> @@ -0,0 +1,63 @@
> +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/net/bluetooth/qcom,ipq5018-bt.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: Qualcomm IPQ5018 Bluetooth
> +
> +maintainers:
> +  - George Moussalem <george.moussalem@outlook.com>
> +
> +properties:
> +  compatible:
> +    enum:
> +      - qcom,ipq5018-bt
> +
> +  interrupts:
> +    items:
> +      - description:
> +          Interrupt line from the M0 Bluetooth Subsystem to the host processor

What is M0?

Anyway, this part feels completely redundant. Can "interrupts" property
be anything else than an interrupt line from the device to the host
processor?


> +          to notify it of events such as re

This feels useful, but cut/incomplete.

> +
> +  qcom,ipc:
> +    $ref: /schemas/types.yaml#/definitions/phandle-array
> +    items:
> +      - items:
> +          - description: phandle to a syscon node representing the APCS registers
> +          - description: u32 representing offset to the register within the syscon
> +          - description: u32 representing the ipc bit within the register
> +    description: |
> +      These entries specify the outgoing IPC bit used for signaling the remote
> +      M0 BTSS core of a host event or for sending an ACK if the remote processor
> +      expects it.
> +
> +  qcom,rproc:
> +    $ref: /schemas/types.yaml#/definitions/phandle
> +    description:
> +      Phandle to the remote processor node representing the M0 BTSS core.
> +
> +required:
> +  - compatible
> +  - interrupts
> +  - qcom,ipc
> +  - qcom,rproc
> +
> +allOf:
> +  - $ref: bluetooth-controller.yaml#
> +  - $ref: qcom,bluetooth-common.yaml
> +
> +unevaluatedProperties: false
> +
> +examples:
> +  - |
> +    #include <dt-bindings/interrupt-controller/arm-gic.h>
> +
> +    bluetooth: bluetooth {

Drop unused label

> +      compatible = "qcom,ipq5018-bt";
> +
> +      qcom,ipc = <&apcs_glb 8 23>;
> +      interrupts = <GIC_SPI 162 IRQ_TYPE_EDGE_RISING>;

No firmware to load?

It feels like remoteproc node split is fake. The property qcom,rproc is
even more supporting that case. Shouldn't this be simply one device -
bluetooth? What sort of two devices do you have exactly? How can I
identify them in the hardware?

> +
> +      qcom,rproc = <&m0_btss>;

Best regards,
Krzysztof


^ permalink raw reply

* Re: [PATCH 1/6] dt-bindings: remoteproc: document M0 Bluetooth Subsystem secure PIL
From: George Moussalem @ 2026-06-26 10:51 UTC (permalink / raw)
  To: Krzysztof Kozlowski
  Cc: Jens Axboe, Ulf Hansson, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, Johannes Berg, Jeff Johnson, Bartosz Golaszewski,
	Marcel Holtmann, Luiz Augusto von Dentz, Balakrishna Godavarthi,
	Rocky Liao, Saravana Kannan, Andrew Lunn, Heiner Kallweit,
	Russell King, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Bjorn Andersson, Konrad Dybcio,
	Mathieu Poirier, Philipp Zabel, linux-block, linux-kernel,
	linux-mmc, devicetree, linux-wireless, ath10k, linux-arm-msm,
	linux-bluetooth, netdev, linux-remoteproc
In-Reply-To: <20260626-tiny-warm-jerboa-3ba57a@quoll>

Hi Krzysztof,

On 6/26/26 14:47, Krzysztof Kozlowski wrote:
> On Thu, Jun 25, 2026 at 06:10:05PM +0400, George Moussalem wrote:
>> Document the M0 Bluetooth Subsystem remote processor core found in the
>> Qualcomm IPQ5018 SoC. Firmware loaded is authenticated via TrustZone.
>> The firmware running on the M0 core provides bluetooth functionality.
>>
>> Signed-off-by: George Moussalem <george.moussalem@outlook.com>
>> ---
>>  .../bindings/remoteproc/qcom,m0-btss-pil.yaml      | 72 ++++++++++++++++++++++
>>  1 file changed, 72 insertions(+)
>>
>> diff --git a/Documentation/devicetree/bindings/remoteproc/qcom,m0-btss-pil.yaml b/Documentation/devicetree/bindings/remoteproc/qcom,m0-btss-pil.yaml
>> new file mode 100644
>> index 000000000000..397bb6815d71
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/remoteproc/qcom,m0-btss-pil.yaml
> 
> Use compatible as filename.

understood, will update in v2.

> 
>> @@ -0,0 +1,72 @@
>> +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
>> +%YAML 1.2
>> +---
>> +$id: http://devicetree.org/schemas/remoteproc/qcom,m0-btss-pil.yaml#
>> +$schema: http://devicetree.org/meta-schemas/core.yaml#
>> +
>> +title: Qualcomm M0 BTSS Peripheral Image Loader
>> +
>> +maintainers:
>> +  - George Moussalem <george.moussalem@outlook.com>
>> +
>> +description:
>> +  Qualcomm M0 BTSS Peripheral Secure Image Loader loads firmware and powers up
>> +  the M0 BTSS remote processor core on the Qualcomm IPQ5018 SoC.
>> +
>> +properties:
>> +  compatible:
>> +    enum:
>> +      - qcom,ipq5018-btss-pil
>> +
>> +  firmware-name:
>> +    maxItems: 1
>> +    description: Firmware name for the M0 Bluetooth Subsystem core
> 
> You can drop description, pretty obvious.

will drop

> 
>> +
>> +  clocks:
>> +    items:
>> +      - description: M0 BTSS low power oscillator clock
>> +
>> +  clock-names:
>> +    items:
>> +      - const: btss_lpo_clk
> 
> Just "lpo"

will update

> 
>> +
>> +  memory-region:
>> +    items:
>> +      - description: M0 BTSS reserved memory carveout
>> +
>> +  resets:
>> +    items:
>> +      - description: M0 BTSS reset
>> +
>> +  reset-names:
>> +    items:
>> +      - const: btss_reset
> 
> Drop names. Using block name as input name is not really useful.

Will drop

> 
> No supplies? no address space? How do you actually trigger remoteproc
> startup?

No supplied and no address space. The core is booted by a
qcom_scm_auth_and_reset call to TrustZone which authenticated the
firmware, takes it out of reset and boots it.

> 
>> +
>> +required:
>> +  - compatible
>> +  - firmware-name
>> +  - clocks
>> +  - clock-names
>> +  - resets
>> +  - reset-names
>> +  - memory-region
>> +
>> +additionalProperties: false
> 
> Best regards,
> Krzysztof
> 


^ permalink raw reply

* Re: [patch 09/24] timekeeping: Add CLOCK_AUX support for ktime_get_snapshot_id()
From: David Woodhouse @ 2026-06-26 10:51 UTC (permalink / raw)
  To: Thomas Weißschuh, Thomas Gleixner
  Cc: LKML, Miroslav Lichvar, John Stultz, Stephen Boyd,
	Anna-Maria Behnsen, Frederic Weisbecker, Arthur Kiyanovski,
	Rodolfo Giometti, Vincent Donnefort, Marc Zyngier, Oliver Upton,
	kvmarm, Oliver Upton, Richard Cochran, netdev, Takashi Iwai,
	Miri Korenblit, Johannes Berg, Jacob Keller, Tony Nguyen,
	Saeed Mahameed, Peter Hilber, Michael S. Tsirkin, virtualization,
	linux-wireless, linux-sound
In-Reply-To: <20260626103359-66ab2b54-d36f-416b-94a4-3f3708dccced@linutronix.de>

[-- Attachment #1: Type: text/plain, Size: 1633 bytes --]

On Fri, 2026-06-26 at 10:48 +0200, Thomas Weißschuh wrote:
> 'tkd' is also used to compute 'monoraw'. However 'tkr_raw' and 'tkr_mono'
> are the same for auxilary clocks, so this will compute a wrong 'monoraw'.
> Instead 'monoraw' should be computed based on 'tk_core'.
> Which then also requires the sequence locking of 'tk_core'.
> 
> As you know I have a series which unifies the locking between the
> different timekeepers. Maybe we revert this patch for 7.2 and I send
> a fixed variant including the prerequisites for 7.3.
> 
> (The same goes for get_device_system_crosststamp())

No fundamental objection from me... but does it matter in practice?
Is it even reachable?

I think the only way these functions can end up being invoked for aux
clocks is via PTP_SYS_OFFSET_EXTENDED, which since commit a6d799608e6
("ptp: Switch to ktime_get_snapshot_id() for pre/post timestamps") and
some other driver-specific changes in this series, can now invoke them
with the user-requested clockid.

But those callers *only* care about ::systime, don't they? So the
problem never arises because there's no code path in which anything
actually cares about ::monoraw for AUX clocks.

If we back out the handling of AUX clocks in ktime_get_snapshot_id()
we'd have to roll back much of that other plumbing too. If we *just*
remove the 'case CLOCK_AUX ... CLOCK_AUX_LAST' hunk, we'd end up with
users being able to trigger the 'default: WARN_ON_ONCE(1)' which
follows.

So while you're right, I *think* it's harmless in practice and
reverting it safely is slightly more complex than it seems at first
glance?

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [patch 09/24] timekeeping: Add CLOCK_AUX support for ktime_get_snapshot_id()
From: Thomas Gleixner @ 2026-06-26 10:49 UTC (permalink / raw)
  To: Thomas Weißschuh
  Cc: LKML, David Woodhouse, Miroslav Lichvar, John Stultz,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker,
	Arthur Kiyanovski, Rodolfo Giometti, Vincent Donnefort,
	Marc Zyngier, Oliver Upton, kvmarm, Oliver Upton, Richard Cochran,
	netdev, Takashi Iwai, Miri Korenblit, Johannes Berg, Jacob Keller,
	Tony Nguyen, Saeed Mahameed, Peter Hilber, Michael S. Tsirkin,
	virtualization, linux-wireless, linux-sound
In-Reply-To: <20260626103359-66ab2b54-d36f-416b-94a4-3f3708dccced@linutronix.de>

On Fri, Jun 26 2026 at 10:48, Thomas Weißschuh wrote:
> On Tue, May 26, 2026 at 07:14:13PM +0200, Thomas Gleixner wrote:
> (...)
>
>>  static inline void tk_update_aux_offs(struct timekeeper *tk, ktime_t offs)
>> @@ -1218,6 +1223,12 @@ bool ktime_get_snapshot_id(struct system
>>  		tkd = &tk_core;
>>  		offs = &tk_core.timekeeper.offs_boot;
>>  		break;
>> +	case CLOCK_AUX ... CLOCK_AUX_LAST:
>> +		tkd = aux_get_tk_data(clock_id);
>> +		if (!tkd)
>> +			return false;
>> +		offs = &tkd->timekeeper.offs_aux;
>> +		break;
>
> 'tkd' is also used to compute 'monoraw'. However 'tkr_raw' and 'tkr_mono'
> are the same for auxilary clocks, so this will compute a wrong 'monoraw'.

AUX clocks are independent in the first place and the MONORAW part is
the "MONORAW" related to the AUX clock itself. 

> Instead 'monoraw' should be computed based on 'tk_core'.
> Which then also requires the sequence locking of 'tk_core'.

No. From a PTP and steering point of view you want the "raw" value which
is related to the AUX clock itself and not the global one.

The global MONORAW and the AUX clock MONORAW are related obviously as
they share the same conversion factors.

     MONORAW(AUX$N) = MONORAW(GLOBAL) + OFFSET(AUX$N)

We don't have that information anywhere right now and we might want to
expose it to allow user space a proper correlation, but that's an
orthogonal problem.

From a PTP/steering point of view the code is correct as is.

Thanks,

        tglx



^ permalink raw reply

* Re: [PATCH 1/6] dt-bindings: remoteproc: document M0 Bluetooth Subsystem secure PIL
From: Krzysztof Kozlowski @ 2026-06-26 10:47 UTC (permalink / raw)
  To: George Moussalem
  Cc: Jens Axboe, Ulf Hansson, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, Johannes Berg, Jeff Johnson, Bartosz Golaszewski,
	Marcel Holtmann, Luiz Augusto von Dentz, Balakrishna Godavarthi,
	Rocky Liao, Saravana Kannan, Andrew Lunn, Heiner Kallweit,
	Russell King, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Bjorn Andersson, Konrad Dybcio,
	Mathieu Poirier, Philipp Zabel, linux-block, linux-kernel,
	linux-mmc, devicetree, linux-wireless, ath10k, linux-arm-msm,
	linux-bluetooth, netdev, linux-remoteproc
In-Reply-To: <20260625-ipq5018-bluetooth-v1-1-d999be0e04f7@outlook.com>

On Thu, Jun 25, 2026 at 06:10:05PM +0400, George Moussalem wrote:
> Document the M0 Bluetooth Subsystem remote processor core found in the
> Qualcomm IPQ5018 SoC. Firmware loaded is authenticated via TrustZone.
> The firmware running on the M0 core provides bluetooth functionality.
> 
> Signed-off-by: George Moussalem <george.moussalem@outlook.com>
> ---
>  .../bindings/remoteproc/qcom,m0-btss-pil.yaml      | 72 ++++++++++++++++++++++
>  1 file changed, 72 insertions(+)
> 
> diff --git a/Documentation/devicetree/bindings/remoteproc/qcom,m0-btss-pil.yaml b/Documentation/devicetree/bindings/remoteproc/qcom,m0-btss-pil.yaml
> new file mode 100644
> index 000000000000..397bb6815d71
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/remoteproc/qcom,m0-btss-pil.yaml

Use compatible as filename.

> @@ -0,0 +1,72 @@
> +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/remoteproc/qcom,m0-btss-pil.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: Qualcomm M0 BTSS Peripheral Image Loader
> +
> +maintainers:
> +  - George Moussalem <george.moussalem@outlook.com>
> +
> +description:
> +  Qualcomm M0 BTSS Peripheral Secure Image Loader loads firmware and powers up
> +  the M0 BTSS remote processor core on the Qualcomm IPQ5018 SoC.
> +
> +properties:
> +  compatible:
> +    enum:
> +      - qcom,ipq5018-btss-pil
> +
> +  firmware-name:
> +    maxItems: 1
> +    description: Firmware name for the M0 Bluetooth Subsystem core

You can drop description, pretty obvious.

> +
> +  clocks:
> +    items:
> +      - description: M0 BTSS low power oscillator clock
> +
> +  clock-names:
> +    items:
> +      - const: btss_lpo_clk

Just "lpo"

> +
> +  memory-region:
> +    items:
> +      - description: M0 BTSS reserved memory carveout
> +
> +  resets:
> +    items:
> +      - description: M0 BTSS reset
> +
> +  reset-names:
> +    items:
> +      - const: btss_reset

Drop names. Using block name as input name is not really useful.

No supplies? no address space? How do you actually trigger remoteproc
startup?

> +
> +required:
> +  - compatible
> +  - firmware-name
> +  - clocks
> +  - clock-names
> +  - resets
> +  - reset-names
> +  - memory-region
> +
> +additionalProperties: false

Best regards,
Krzysztof


^ permalink raw reply

* Re: [PATCH net v2 1/1] net/sched: sch_teql: Introduce slaves_lock to avoid race condition and UAF
From: Jamal Hadi Salim @ 2026-06-26 10:16 UTC (permalink / raw)
  To: netdev
  Cc: davem, edumazet, kuba, pabeni, horms, jiri, victor, security,
	zdi-disclosures, stable, kernel test robot
In-Reply-To: <20260624224016.24018-1-jhs@mojatatu.com>

"

On Wed, Jun 24, 2026 at 6:40 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> The teql master->slaves singly linked list is not protected against
> multiple writes. It can be mod'ed concurently from teql_master_xmit(),
> teql_dequeue(), teql_init() and teql_destroy() without holding any list
> lock or RCU protection.
>
> zdi-disclosures@trendmicro.com has demonstrated that the qdisc is freed
> after an RCU grace period, but teql_master_xmit() running on another
> CPU can still hold a stale pointer into the list, resulting in a
> slab-use-after-free:
>
> BUG: KASAN: slab-use-after-free in teql_destroy+0x3ca/0x440 linux/net/sched/sch_teql.c:142
> Read of size 8 at addr ffff88802923aa80 by task ip/10024
>
> The zdi-disclosures@trendmicro.com repro created concurrent AF_PACKET
> senders on a teql device against a thread that repeatedly adds/deletes the
> slave qdisc, together with a SLUB spray that reclaims the freed slot; the
> resulting UAF is controllable enough to be turned into a read/write
> primitive against the freed qdisc object.
>
> The fix?
> Add a per-master slaves_lock spinlock that serializes all mutations of
> master->slaves and the NEXT_SLAVE() links in teql_destroy() and
> teql_qdisc_init(). teql_master_xmit() also takes the same slaves_lock
> around those updates.
> Annotate master->slaves and the per-slave ->next pointer with __rcu and
> use the appropriate RCU accessors everywhere they are touched:
> rcu_assign_pointer() on the writer side (under slaves_lock),
> rcu_dereference_protected() for the writer-side loads (also under
> slaves_lock), rcu_dereference_bh() for the loads in teql_master_xmit() and
> rtnl_dereference() for the loads in teql_master_open()/teql_master_mtu(),
> which run under RTNL.
> Pair this with rcu_read_lock_bh()/rcu_read_unlock_bh() around the list
> traversal in teql_master_xmit(), so that readers either observe a fully
> linked list or are deferred until the in-flight mutation completes. The two
> early-return paths in teql_master_xmit() are updated to release the RCU-bh
> read-side critical section before returning, since leaving it held would
> disable BH on that CPU for good.
>

sashiko-gemini's complaints:
https://sashiko.dev/#/patchset/20260624224016.24018-1-jhs%40mojatatu.com
seem bogus to me (someone correct me if i am wrong). I am only going
to address the first claim of "TOCTOU / "resurrection" race in
teql_master_xmit()"
teql_master_xmit() holds rcu_read_lock_bh() across the entire
traversal. teql_destroy() freeing can only proceed once the qdisc's
RCU grace period has elapsed - so where is this TOCTOU? Let's say this
were true: both calls hold the slaves_lock.
The other issues are of similar nature.

OTOH, sashiko-claude
(https://netdev-ai.bots.linux.dev/sashiko/#/patchset/20260624224016.24018-1-jhs%40mojatatu.com)
does make some valid claims which are low value, so not sure a resend
is worth it.
For example in claim 1 it says "Should the changelog mention this
teql_dequeue() site too?" Sure I can - but just because I provided
extra information in the commit log, which I could have omitted, now I
have to add more info? ;-> The second claim is "rcu_dereference_bh()
should be rcu_dereference_protected() on writer side". Sparse didnt
complain and i dont see this as breakage rather a consistency measure.

Unless I am missing something ..

cheers,
jamal

^ permalink raw reply

* Re: [PATCH net-next 6/9] net: sparx5: vcap: add lpm vcap implementation
From: Jens Emil Schulz Ostergaard @ 2026-06-26 10:16 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: horatiu.vultur, UNGLinuxDriver, andrew+netdev, davem, edumazet,
	pabeni, daniel.machon, Steen.Hegelund, kees, gustavoars, netdev,
	linux-kernel, linux-arm-kernel, linux-hardening
In-Reply-To: <20260615233849.1042290-1-kuba@kernel.org>

On Mon, 2026-06-15 at 16:38 -0700, Jakub Kicinski wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> 
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> net: sparx5: vcap: add lpm vcap implementation
> 
> Add the implementation for the LPM VCAP for lan969x and sparx5. The LPM
> VCAP has a static keyset configuration, and a single lookup dedicated to
> LPM, so this is straight forward, and many functions can use existing
> functionality.
> 
> > diff --git a/drivers/net/ethernet/microchip/sparx5/sparx5_vcap_impl.c b/drivers/net/ethernet/microchip/sparx5/sparx5_vcap_impl.c
> > index 95b93e46a41dc..e25e759c24f60 100644
> > --- a/drivers/net/ethernet/microchip/sparx5/sparx5_vcap_impl.c
> > +++ b/drivers/net/ethernet/microchip/sparx5/sparx5_vcap_impl.c
> > @@ -1083,6 +1146,46 @@ static void sparx5_vcap_es2_cache_write(struct sparx5 *sparx5,
> >       }
> >  }
> > 
> > +static void sparx5_vcap_lpm_cache_write(struct sparx5 *sparx5,
> > +                                     struct vcap_admin *admin,
> > +                                     enum vcap_selection sel,
> > +                                     u32 start,
> > +                                     u32 count)
> > +{
> > +     u32 *keystr, *mskstr, *actstr;
> > +     int idx;
> > +
> > +     keystr = &admin->cache.keystream[start];
> > +     mskstr = &admin->cache.maskstream[start];
> > +     actstr = &admin->cache.actionstream[start];
> > +
> > +     switch (sel) {
> > +     case VCAP_SEL_ENTRY:
> > +             for (idx = 0; idx < count; ++idx) {
> > +                     /* Avoid 'match-off' by setting value & mask */
> > +                     spx5_wr(keystr[idx] & mskstr[idx], sparx5,
> > +                             VCAP_SUPER_VCAP_ENTRY_DAT(idx));
> 
> [Severity: High]
> Are concurrent writes to the shared VCAP_SUPER_VCAP_ENTRY_DAT registers
> safe here without a global hardware lock across the different VCAP types?
> 
> It looks like IS0 and IS2 rules are updated via vcap_add_rule() under the
> rtnl_lock(). However, the LPM rules are updated under the
> sparx5->router->lock inside a workqueue.
> 
> Since both execution paths share the Super VCAP hardware block, could
> concurrent updates from TC (IS0/IS2) and routing (LPM) race on these shared
> cache registers, and also on VCAP_SUPER_CTRL in sparx5_vcap_super_update(),
> garbling the hardware command and corrupting the cache memory?
> 

Very nice catch, I missed this. This is even reachable before this series
because the debugfs dump also uses these registers, and runs outside of
rtnl.

I will send a fix to net which replaces the per instance locks with a single
vcap_control global mutex.

> > +                     spx5_wr(~mskstr[idx], sparx5,
> > +                             VCAP_SUPER_VCAP_MASK_DAT(idx));
> > +             }
> > +             break;
> 
> [ ... ]
> 
> > @@ -1379,6 +1523,7 @@ static void sparx5_vcap_update(struct net_device *ndev,
> >       struct sparx5 *sparx5 = port->sparx5;
> > 
> >       switch (admin->vtype) {
> > +     case VCAP_TYPE_LPM:
> >       case VCAP_TYPE_IS0:
> >       case VCAP_TYPE_IS2:
> >               sparx5_vcap_super_update(sparx5, cmd, sel, addr);


^ permalink raw reply

* Re: [PATCH net-next 1/9] net: microchip: vcap: fix rule move for rules of coprime size
From: Jens Emil Schulz Ostergaard @ 2026-06-26 10:07 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: horatiu.vultur, UNGLinuxDriver, andrew+netdev, davem, edumazet,
	pabeni, daniel.machon, Steen.Hegelund, kees, gustavoars, netdev,
	linux-kernel, linux-arm-kernel, linux-hardening
In-Reply-To: <20260615233845.1042219-1-kuba@kernel.org>

On Mon, 2026-06-15 at 16:38 -0700, Jakub Kicinski wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> 
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> Subject: net: microchip: vcap: fix rule move for rules of coprime size
> 
> This patch addresses a discrepancy in how software and hardware move rule
> addresses on insert and delete for the LPM VCAP. It introduces a uniform
> offset computation aligned to the least common multiple of all rule sizes
> in a block. This ensures that hardware and software representations remain
> synchronized and all rules land on correctly aligned addresses.
> 
> > diff --git a/drivers/net/ethernet/microchip/vcap/vcap_api.c b/drivers/net/ethernet/microchip/vcap/vcap_api.c
> > index 0fdb5e363badc..6946fd7384581 100644
> > --- a/drivers/net/ethernet/microchip/vcap/vcap_api.c
> > +++ b/drivers/net/ethernet/microchip/vcap/vcap_api.c
> 
> [ ... ]
> 
> > @@ -2275,8 +2336,7 @@ int vcap_add_rule(struct vcap_rule *rule)
> >                      __func__, __LINE__, ret);
> >               goto out;
> >       }
> > -     if (move.count > 0)
> > -             vcap_move_rules(ri, &move);
> > +     vcap_move_rules(ri, &move);
> > 
> >       /* Set the counter to zero */
> >       ret = vcap_write_counter(ri, &ctr);
> 
> [Severity: High]
> This is a pre-existing issue, but does vcap_add_rule() leak the newly
> duplicated rule if operations like vcap_write_counter() or
> vcap_write_rule() fail later on?
> 
> If a failure occurs, the code takes the goto out path:
> 
> vcap_add_rule() {
>     ...
>     ret = vcap_write_rule(ri);
>     if (ret) {
>         pr_err("%s:%d: rule write error: %d\n", __func__, __LINE__, ret);
>         goto out;
>     }
> out:
>     mutex_unlock(&ri->admin->lock);
>     return ret;
> }
> 
> This seems to exit without reverting the earlier list insertions and
> hardware moves.
> 
> Could this leave the software and hardware states permanently inconsistent
> and leak the newly duplicated rule?
> 

I think this is latent, and not actually reachable by current callers.

1. vcap_write_counter always returns 0
2. vcap_encode_rule can not error if vcap_val_rule is called before vcap_add_rule
3. vcap_write_rule can not error if vcap_val_rule is called before vcap_add_rule

So I propose this is fixed in seperate series.

> [ ... ]
> 
> > @@ -2488,59 +2548,52 @@ int vcap_mod_rule(struct vcap_rule *rule)
> >  }
> >  EXPORT_SYMBOL_GPL(vcap_mod_rule);
> 
> [Severity: High]
> This isn't a bug introduced by this patch, but does vcap_mod_rule() discard
> the active software rule pointer and overwrite hardware with stale addresses?
> 
> Looking at vcap_mod_rule(), the function validates the rule using
> vcap_get_locked_rule() but ignores the returned pointer:
> 
> vcap_mod_rule(struct vcap_rule *rule) {
>     struct vcap_rule_internal *ri = to_intrule(rule);
>     ...
>     if (!vcap_get_locked_rule(ri->vctrl, ri->data.id))
>         return -ENOENT;
>     ...
>     err = vcap_encode_rule(ri);
>     if (err)
>         goto out;
> 
>     err = vcap_write_rule(ri);
> }
> 
> Because the function continues to use the caller provided copy ri, it
> doesn't see updates to the real rule in the admin->rules list.
> 
> If the real rule's hardware address was previously shifted due to insertions
> or deletions, could writing to the stale address in ri->addr corrupt other
> rules in the hardware VCAP?
> 

This is also true but latent again. To trigger this we would need concurrency
against the same vcap instance. Before this series vcap_mod_rule was only
called under rtnl. This series introduces a call outside rtnl, but the new LPM VCAP
is not usable by tc only the routing module, which serializes everything under
one mutex.

So I think this should be fixed in a seperate series.

> [ ... ]
> 
> One other pre-existing issue noticed while reviewing this subsystem:
> 
> [Severity: High]
> This is a pre-existing issue, but is it safe to iterate over the admin->rules
> list across different VCAP instances without acquiring admin->lock for
> each instance?
> 
> In vcap_rule_exists():
> 
> static bool vcap_rule_exists(struct vcap_control *vctrl, u32 id) {
>     ...
>     list_for_each_entry(admin, &vctrl->list, list)
>         list_for_each_entry(ri, &admin->rules, list)
>             if (ri->data.id == id)
>                 return true;
>     return false;
> }
> 
> If another thread is concurrently inserting or deleting a rule on another
> VCAP instance while holding that instance's lock, could this unprotected
> concurrent access encounter a data race and dereference a poisoned pointer?

Yes this becomes reachable with this series, due to calls outside rtnl.
It is a symptom of a wider issue with the per instance locking in the VCAP
api. Sashiko found another existing bug with the shared SUPER vcap registers
also caused by this, and that one is reachable in mainline, so I will send a
fix to net for the vcap locking which will also fix this problem, then send
v2 once that is settled.

> --
> pw-bot: cr


^ permalink raw reply

* [ANNOUNCEMENT] LPC 2026: System Monitoring and Observability Microconference
From: Breno Leitao @ 2026-06-26  9:56 UTC (permalink / raw)
  To: linux-acpi, linux-hwmon, netdev, linux-kernel, linux-arm-kernel,
	kernel-team, linux-mm
  Cc: Breno Leitao, kerneljasonxing, iipeace5, gavinguo, linux,
	amscanne, sj, gpiccoli, Daniel Gomez, mfo, platform-driver-x86,
	acpica-devel

We are pleased to announce the Call for Proposals (CFP) for another
edition of  System Monitoring and Observability Microconference, this
time at the 2026 Linux Plumbers Conference (LPC), taking place in
Prague, Czechia, from Oct 5-7, 2026.

  https://lpc.events/event/20/sessions/262/

This microconference provides a valuable forum for key engineering areas
such as:

   - Kernel Health and Runtime Monitoring
   - Hardware Integration and Error Detection
   - Correlation of Issues (crashes, stalls, bugs)
   - Virtualization Stack Monitoring
   - Memory Management Monitoring and Observability
   - Anomaly Detection Algorithms for System Behavior
   - Automated Analysis, Remediation and post mortem analyzes

The purpose of each talk is to share challenges and discuss potential
improvements. Sessions will last 20 to 30 minutes and aim to encourage
brainstorming and open dialogue about ongoing issues rather than
delivering immediate solutions.

The conference acts as both a knowledge-sharing platform and a strategic
venue for guiding the future of kernel technologies to better meet the
demands of large-scale infrastructure.

We invite you to submit your proposals here:
	https://lpc.events/event/20/abstracts/

Please select track "Linux System Monitoring and Observability MC"

^ permalink raw reply

* Re: [PATCH net v2 1/1] net: sched: ets: avoid deficit wrap and bound empty dequeue rounds
From: Jamal Hadi Salim @ 2026-06-26  9:54 UTC (permalink / raw)
  To: Ren Wei
  Cc: netdev, jiri, davem, petrm, yuantan098, yifanwucs, tomapufckgml,
	zcliangcn, bird, bronzed_45_vested
In-Reply-To: <0e17a0309061300d31036a6a4c139919192f6373.1782379460.git.bronzed_45_vested@icloud.com>

On Fri, Jun 26, 2026 at 4:32 AM Ren Wei <n05ec@lzu.edu.cn> wrote:
>
> From: Wyatt Feng <bronzed_45_vested@icloud.com>
>
> ETS keeps each DRR-style deficit in a u32 and replenishes it with
> the configured quantum whenever the head packet is too large. Both
> the quantum and qdisc_pkt_len() are user-controlled inputs: a large
> quantum can wrap the deficit counter, while a tiny quantum combined
> with an inflated qdisc_pkt_len() can force billions of iterations in
> softirq context before any packet becomes eligible.
>
> Store the deficit in u64 so replenishment cannot wrap the counter.
> This keeps the existing dequeue logic unchanged while fixing the
> overflow condition.
>
> Bound one dequeue attempt to at most nbands * 2 ETS rotations, as
> suggested in review. This avoids the livelock without adding heavier
> logic to the fast path.
>
> Fixes: dcc68b4d8084 ("net: sch_ets: Add a new Qdisc")
> Cc: stable@vger.kernel.org
> Reported-by: Yuan Tan <yuantan098@gmail.com>
> Reported-by: Yifan Wu <yifanwucs@gmail.com>
> Reported-by: Juefei Pu <tomapufckgml@gmail.com>
> Reported-by: Zhengchuan Liang <zcliangcn@gmail.com>
> Reported-by: Xin Liu <bird@lzu.edu.cn>
> Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com>
> Assisted-by: Codex:GPT-5.4
> Signed-off-by: Wyatt Feng <bronzed_45_vested@icloud.com>
> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

Note, you did not Cc many maintainers. Next time, or if you have to
resend this patch make sure you Cc the stakeholders (see
scripts/get_maintainers.pl)

cheers,
jamal

> ---
> changes in v2:
>   - Instead of doing a div() in the fast path, simply bound the loop per
>     dequeue
>   - v1 Link: https://lore.kernel.org/all/20260615103759.2404228-2-n05ec@lzu.edu.cn/
>
>
>  net/sched/sch_ets.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/net/sched/sch_ets.c b/net/sched/sch_ets.c
> index cb8cf437ce87..12a156ccb0a6 100644
> --- a/net/sched/sch_ets.c
> +++ b/net/sched/sch_ets.c
> @@ -40,7 +40,7 @@ struct ets_class {
>         struct list_head alist; /* In struct ets_sched.active. */
>         struct Qdisc *qdisc;
>         u32 quantum;
> -       u32 deficit;
> +       u64 deficit;
>         struct gnet_stats_basic_sync bstats;
>         struct gnet_stats_queue qstats;
>  };
> @@ -463,6 +463,8 @@ ets_qdisc_dequeue_skb(struct Qdisc *sch, struct sk_buff *skb)
>  static struct sk_buff *ets_qdisc_dequeue(struct Qdisc *sch)
>  {
>         struct ets_sched *q = qdisc_priv(sch);
> +       unsigned int max_loops = READ_ONCE(q->nbands) * 2;
> +       unsigned int loops = 0;
>         struct ets_class *cl;
>         struct sk_buff *skb;
>         unsigned int band;
> @@ -499,6 +501,8 @@ static struct sk_buff *ets_qdisc_dequeue(struct Qdisc *sch)
>
>                 cl->deficit += READ_ONCE(cl->quantum);
>                 list_move_tail(&cl->alist, &q->active);
> +               if (++loops > max_loops)
> +                       goto out;
>         }
>  out:
>         return NULL;
> --
> 2.47.3
>

^ permalink raw reply

* Re: [PATCH v3] net: fman: fix use-after-free on IRQF_SHARED handler after probe failure
From: 赵金明 @ 2026-06-26  9:53 UTC (permalink / raw)
  To: horms
  Cc: horms, Andrew Lunn, andrew+netdev, davem, edumazet, kuba,
	linux-kernel, madalin.bucur, netdev, pabeni, sean.anderson
In-Reply-To: <20260625164237.1222477-1-horms@kernel.org>

Hi,

The analysis is logically correct. Since fman is zero-initialized by
kzalloc_obj(), both fman->cfg and fman->fpm_regs are NULL when
devm_request_irq() registers the shared IRQ handler. The guard in
fman_irq():

    if (!is_init_done(fman->cfg))
        return IRQ_NONE;

does not protect against this case because is_init_done(NULL) returns
true, so the handler would proceed to dereference the NULL
fpm_regs pointer via ioread32be().

However, this is a pre-existing issue unrelated to the UAF fix in this
patch. The window is very short -- between devm_request_irq() and the
completion of fman_init() -- and would require another device on the
same shared IRQ line to fire an interrupt during that interval.

If this should be addressed, I will send a separate patch for it.
The current patch is focused solely on the post-IRQ-registration UAF
on error paths.

Please let me know if you would like me to handle this separately.

Best regards,
ZhaoJinming



>This is an AI-generated review of your patch. The human sending this



>email has considered the AI review valid, or at least plausible.



>Full review at: https://sashiko.dev



>---



>net: fman: fix use-after-free on IRQF_SHARED handler after probe failure



>



>This patch fixes a use-after-free in the error path of read_dts_node(). It



>manually releases the registered IRQ handlers before freeing the fman



>structure, preventing the devres cleanup from leaving a registered IRQ



>pointing to freed memory.



>



>> diff --git a/drivers/net/ethernet/freescale/fman/fman.c b/drivers/net/ethernet/freescale/fman/fman.c



>> index 013273a2de32f..ba2338da0cea1 100644



>> --- a/drivers/net/ethernet/freescale/fman/fman.c



>> +++ b/drivers/net/ethernet/freescale/fman/fman.c



>[ ... ]



>> @@ -2786,7 +2790,7 @@ static struct fman *read_dts_node(struct platform_device *of_dev)



>>? 		if (err < 0) {



>>? 			dev_err(&of_dev->dev, "%s: irq %d allocation failed (error = %d)\n",



>>? 				__func__, fman->dts_params.err_irq, err);



>> -			goto fman_free;



>> +			goto free_main_irq;



>



>[Severity: High]



>This is a pre-existing issue, but since we are modifying the IRQ error



>handling here, could the shared interrupt handler crash if it fires during



>this setup phase?



>



>Just before this snippet in read_dts_node(), devm_request_irq() is called



>with IRQF_SHARED while fman is still partially initialized. Since fman was



>allocated with kzalloc_obj(), fman->cfg and fman->fpm_regs are NULL.



>



>If a shared interrupt fires immediately, fman_irq() is called:



>



>drivers/net/ethernet/freescale/fman/fman.c:fman_irq() {



>??????? ...



>??????? if (!is_init_done(fman->cfg))



>??????????????? return IRQ_NONE;



>



>??????? fpm_rg = fman->fpm_regs;



>??????? pending = ioread32be(&fpm_rg->fm_npi);



>??????? ...



>}



>



>Since fman->cfg is NULL, is_init_done() behaves like this:



>



>drivers/net/ethernet/freescale/fman/fman.c:is_init_done() {



>??????? if (!cfg)



>??????????????? return true;



>??????? ...



>}



>



>Would this cause fman_irq() to bypass the early return, proceed with a NULL



>fpm_rg pointer, and dereference it during the ioread32be() call?



>



>>? 		}



>>? 	}



>



^ permalink raw reply

* [PATCH net-next v1] tcp/dccp: avoid parity split for socket-local bind range
From: xuanqiang.luo @ 2026-06-26  9:38 UTC (permalink / raw)
  To: Eric Dumazet, Neal Cardwell, netdev
  Cc: Kuniyuki Iwashima, David S . Miller, Jakub Kicinski, Paolo Abeni,
	Simon Horman, luoxuanqiang

From: luoxuanqiang <luoxuanqiang@kylinos.cn>

IP_LOCAL_PORT_RANGE lets applications override the netns ephemeral port
range on a per-socket basis.  __inet_hash_connect() already treats such a
range as an explicit application partition and scans it with step 1 [1].

Do the same in inet_csk_find_open_port(): when a socket-local range is set,
walk the whole selected range instead of first splitting it by parity.
Keep the existing step-2 parity behavior for sockets using the netns range,
so the default bind/connect separation remains unchanged.

[1] https://lore.kernel.org/r/20231214192939.1962891-3-edumazet@google.com

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: luoxuanqiang <luoxuanqiang@kylinos.cn>
---
 net/ipv4/inet_connection_sock.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 56902bba54838..ad8af70c92ca3 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -323,13 +323,16 @@ inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret,
 	struct inet_bind2_bucket *tb2;
 	struct inet_bind_bucket *tb;
 	u32 remaining, offset;
+	bool local_ports;
 	bool relax = false;
+	int step;
 
 	l3mdev = inet_sk_bound_l3mdev(sk);
 ports_exhausted:
 	attempt_half = (sk->sk_reuse == SK_CAN_REUSE) ? 1 : 0;
 other_half_scan:
-	inet_sk_get_local_port_range(sk, &low, &high);
+	local_ports = inet_sk_get_local_port_range(sk, &low, &high);
+	step = local_ports ? 1 : 2;
 	high++; /* [32768, 60999] -> [32768, 61000[ */
 	if (high - low < 4)
 		attempt_half = 0;
@@ -342,18 +345,19 @@ inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret,
 			low = half;
 	}
 	remaining = high - low;
-	if (likely(remaining > 1))
+	if (!local_ports && remaining > 1)
 		remaining &= ~1U;
 
 	offset = get_random_u32_below(remaining);
 	/* __inet_hash_connect() favors ports having @low parity
 	 * We do the opposite to not pollute connect() users.
 	 */
-	offset |= 1U;
+	if (!local_ports)
+		offset |= 1U;
 
 other_parity_scan:
 	port = low + offset;
-	for (i = 0; i < remaining; i += 2, port += 2) {
+	for (i = 0; i < remaining; i += step, port += step) {
 		if (unlikely(port >= high))
 			port -= remaining;
 		if (inet_is_local_reserved_port(net, port))
@@ -384,9 +388,11 @@ inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret,
 		cond_resched();
 	}
 
-	offset--;
-	if (!(offset & 1))
-		goto other_parity_scan;
+	if (!local_ports) {
+		offset--;
+		if (!(offset & 1))
+			goto other_parity_scan;
+	}
 
 	if (attempt_half == 1) {
 		/* OK we now try the upper half of the range */
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v2] netfilter: nf_log: validate MAC header was set before dumping it
From: Alexander Martyniuk @ 2026-06-26  9:38 UTC (permalink / raw)
  To: gregkh
  Cc: alexevgmart, bestswngs, coreteam, davem, fw, kaber, kadlec, kuba,
	kuznet, linux-kernel, netdev, netfilter-devel, pablo, sashal,
	stable, xmei5, yoshfuji
In-Reply-To: <2026062658-pregame-buggy-ccbc@gregkh>

> What kernel(s) is this for?

5.10

^ permalink raw reply

* [PATCH net v2] net: libwx: fix VMDQ mask for 1-queue mode
From: Jiawen Wu @ 2026-06-26  9:25 UTC (permalink / raw)
  To: netdev
  Cc: Mengyuan Lou, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Jacob Keller,
	Kees Cook, Jiawen Wu, Larysa Zaremba

In wx_set_vmdq_queues(), the VMDQ mask was not set for the devices not
supporting WX_FLAG_MULTI_64_FUNC, i.e., NGBE devices. A mask of 0 causes
__ALIGN_MASK(1, ~vmdq->mask) to return 0, which incorrectly sets
q_per_pool to 0 in wx_write_qde().

Fix the VMDQ 1-queue mask to 0x7F then ensures that __ALIGN_MASK(1,
~0x7F) correctly evaluates to 1.

Fixes: c52d4b898901 ("net: libwx: Redesign flow when sriov is enabled")
Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
v2: Fix the commit message.
---
 drivers/net/ethernet/wangxun/libwx/wx_lib.c  | 1 +
 drivers/net/ethernet/wangxun/libwx/wx_type.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/wangxun/libwx/wx_lib.c b/drivers/net/ethernet/wangxun/libwx/wx_lib.c
index d042567b8128..814d88d2aee4 100644
--- a/drivers/net/ethernet/wangxun/libwx/wx_lib.c
+++ b/drivers/net/ethernet/wangxun/libwx/wx_lib.c
@@ -1802,6 +1802,7 @@ static bool wx_set_vmdq_queues(struct wx *wx)
 			rss_i = 4;
 		}
 	} else {
+		vmdq_m = WX_VMDQ_1Q_MASK;
 		/* double check we are limited to maximum pools */
 		vmdq_i = min_t(u16, 8, vmdq_i);
 
diff --git a/drivers/net/ethernet/wangxun/libwx/wx_type.h b/drivers/net/ethernet/wangxun/libwx/wx_type.h
index c7befe4cdfe9..65e3e55db1cf 100644
--- a/drivers/net/ethernet/wangxun/libwx/wx_type.h
+++ b/drivers/net/ethernet/wangxun/libwx/wx_type.h
@@ -486,6 +486,7 @@ enum WX_MSCA_CMD_value {
 
 #define WX_VMDQ_4Q_MASK              0x7C
 #define WX_VMDQ_2Q_MASK              0x7E
+#define WX_VMDQ_1Q_MASK              0x7F
 
 /****************** Manageablility Host Interface defines ********************/
 #define WX_HI_MAX_BLOCK_BYTE_LENGTH  256 /* Num of bytes in range */
-- 
2.51.0


^ permalink raw reply related

* Re: [PATCH] xsk: fix memory corruptions in net/core/xdp.c
From: Clement Lecigne @ 2026-06-26  9:12 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: edumazet, netdev, bpf, linux-kernel, kuba, sdf, horms,
	john.fastabend, ast, daniel
In-Reply-To: <0922ce5d-48d8-44e7-983c-e547f3126ef4@intel.com>

[-- Attachment #1: Type: text/plain, Size: 5631 bytes --]

Thanks, Olek!

I submitted a v2 of the patch directly as a reply to this thread by
mistake, do you still want me to post this new version separately?

On Thu, Jun 25, 2026 at 5:14 PM Alexander Lobakin
<aleksander.lobakin@intel.com> wrote:
>
> From: Clement Lecigne <clecigne@google.com>
> Date: Wed, 24 Jun 2026 08:41:28 +0000
>
> > From: Clément Lecigne <clecigne@google.com>
> >
> > Commit 560d958c6c68 ("xsk: add generic XSk &xdp_buff -> skb conversion")
> > introduced a vulnerability in the handling of XDP_PASS for AF_XDP zero-copy
> > frames.
> >
> > Note: Currently, this specific AF_XDP zero-copy conversion path is only
> > reachable from the drivers/net/ethernet/intel/ice driver.
>
> idpf uses this, too (every driver based on libeth_xdp in general,
> currently these two).

Done.

>
> >
> > When building an skb, xdp_build_skb_from_zc() uses the chunk size
> > (xdp->frame_sz) for the allocation. However, napi_build_skb() automatically
> > reserves space at the end of the allocation for the skb_shared_info
> > structure.
> >
> > Most high performance UMEM applications use 4K chunks, where the
> > corruption cannot happen. However, if the UMEM is configured with 2KB
> > chunks (a very common configuration to maximize packet density in memory),
> > a standard 1500 MTU packet will trigger the corruption because the required
> > space exceeds the 2048 byte chunk size:
> >
> > Headroom (256) + Packet (1514) + skb_shared_info (320) = 2090 bytes
> >
> > Because 2090 bytes > 2048 bytes and __skb_put() does not perform bounds
> > checking, the memcpy() writes past the available linear data area and
> > corrupts the skb_shared_info structure. This can lead to arbitrary code
> > execution if pointers like destructor_arg are overwritten.
> >
> > Additionally, in xdp_copy_frags_from_zc(), the allocation size is set
> > strictly to the fragment size (len), but the subsequent memcpy() uses
> > LARGEST_ALIGN(len). This mismatch results in an out-of-bounds write of
> > up to 7 bytes, which triggers KASAN warnings and is unsafe despite typical
> > page pool allocator padding.
> >
> > Fix the skb allocation in xdp_build_skb_from_zc() by dynamically
> > calculating the exact truesize required: the sum of the headroom, the
> > packet length, and the skb_shared_info overhead, properly aligned via
> > SKB_DATA_ALIGN.
> >
> > Fix the out-of-bounds write in xdp_copy_frags_from_zc() by rounding up
> > the allocation request using LARGEST_ALIGN(len) to match the copy
> > operation.
> >
> > Fixes: 560d958c6c68 ("xsk: add generic XSk &xdp_buff -> skb conversion")
> > CC: Alexander Lobakin <aleksander.lobakin@intel.com>
> > CC: Eric Dumazet <edumazet@google.com>
> > Signed-off-by: Clément Lecigne <clecigne@google.com>
> > ---
> > diff --git a/net/core/xdp.c b/net/core/xdp.c
> > index 9890a30584ba..f36d1fb875ab 100644
> > --- a/net/core/xdp.c
> > +++ b/net/core/xdp.c
> > @@ -699,7 +699,7 @@ static noinline bool xdp_copy_frags_from_zc(struct sk_buff *skb,
> >       for (u32 i = 0; i < nr_frags; i++) {
> >               const skb_frag_t *frag = &xinfo->frags[i];
> >               u32 len = skb_frag_size(frag);
> > -             u32 offset, truesize = len;
> > +             u32 offset, truesize = LARGEST_ALIGN(len);
>
> I think you need to re-sort this to keep RCT, now that the truesize
> initialization is way longer than it was.

Done.

>
>                 const skb_frag_t *frag = &xinfo->frags[i];
>                 u32 offset, len = skb_frag_size(frag);
>                 u32 truesize = LARGEST_ALIGN(len);
>                 struct page *page;
>
> >               struct page *page;
> >
> >               page = page_pool_dev_alloc(pp, &offset, &truesize);
>
> BTW usually LARGEST_ALIGN() aligns to 16, I've never seen a bigger one.
> IIRC Page Pool never returns a truesize aligned to a smaller value. But
> if you're really able to trigger this, it probably does?
>
> > @@ -740,7 +740,9 @@ struct sk_buff *xdp_build_skb_from_zc(struct xdp_buff *xdp)
> >  {
> >       const struct xdp_rxq_info *rxq = xdp->rxq;
> >       u32 len = xdp->data_end - xdp->data_meta;
> > -     u32 truesize = xdp->frame_sz;
> > +     u32 headroom = xdp->data_meta - xdp->data_hard_start;
> > +     u32 truesize = SKB_DATA_ALIGN(headroom + len) +
> > +                    SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>
> Ah now I get it: xdp->frame_sz doesn't account the shinfo for
> single-buffer frames, only for multi-buffer ones. The fix looks correct,
> but I'd use SKB_HEAD_ALIGN() since it does exactly what you're
> open-coding here and sort the declarations:

Good idea, done.


>
> {
>         u32 hr = xdp->data_meta - xdp->data_hard_start;
>         const struct xdp_rxq_info *rxq = xdp->rxq;
>         u32 len = xdp->data_end - xdp->data_meta;
>         u32 truesize = SKB_HEAD_ALIGN(hr + len);
>         struct sk_buff *skb = NULL;
>         struct page_pool *pp;
>         int metalen;
>         void *data;
>
>         if (!IS_ENABLED(CONFIG_PAGE_POOL))
>                 return NULL;
>
>         ...
>
> >       struct sk_buff *skb = NULL;
> >       struct page_pool *pp;
> >       int metalen;
> > @@ -762,7 +764,7 @@ struct sk_buff *xdp_build_skb_from_zc(struct xdp_buff *xdp)
> >       }
> >
> >       skb_mark_for_recycle(skb);
> > -     skb_reserve(skb, xdp->data_meta - xdp->data_hard_start);
> > +     skb_reserve(skb, headroom);
> >
> >       memcpy(__skb_put(skb, len), xdp->data_meta, LARGEST_ALIGN(len));
>
> Thanks,
> Olek

Thanks,
-clem

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5281 bytes --]

^ permalink raw reply

* [PATCH net v2] nfc: nci: fix uninit-value in the RF discover/activated NTF handlers
From: Samuel Page @ 2026-06-26  9:03 UTC (permalink / raw)
  To: David Heidelberg
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, oe-linux-nfc, netdev, linux-kernel, stable

nci_rf_discover_ntf_packet() and nci_rf_intf_activated_ntf_packet() each
parse a notification into an on-stack struct (nci_rf_discover_ntf /
nci_rf_intf_activated_ntf) that is not initialised. The RF
technology-specific parameters are only extracted when
rf_tech_specific_params_len is non-zero, so a notification that reports a
zero length leaves the rf_tech_specific_params union uninitialised - and
both handlers then pass it to nci_add_new_protocol(), which reads it:

 - discover:  nci_add_new_target() -> nci_add_new_protocol();
 - activated: nci_target_auto_activated() -> nci_add_new_protocol().

nci_add_new_protocol() uses nfca_poll->nfcid1_len as both a branch
condition and a memcpy() length and copies nfcid1/sens_res/sel_res into
ndev->targets, which is later exposed to user space via NFC_CMD_GET_TARGET.

  BUG: KMSAN: uninit-value in nci_add_new_protocol+0x624/0x6c0
   nci_add_new_protocol+0x624/0x6c0
   nci_ntf_packet+0x25b2/0x3c30
   nci_rx_work+0x318/0x5d0
   process_scheduled_works+0x84b/0x17a0
   worker_thread+0xc10/0x11b0
   kthread+0x376/0x500
  Local variable ntf.i created at:
   nci_ntf_packet+0xbc2/0x3c30

Zero-initialise both on-stack notifications so the union reads back as
zero when no technology-specific parameters are present.

Fixes: 019c4fbaa790 ("NFC: Add NCI multiple targets support")
Fixes: e8c0dacd9836 ("NFC: Update names and structs to NCI spec 1.0 d18")
Link: https://lore.kernel.org/netdev/20260623172109.1105965-2-horms@kernel.org/
Cc: stable@vger.kernel.org
Assisted-by: Bynario AI
Signed-off-by: Samuel Page <sam@bynar.io>
---
v2: Drop the inaccurate activation_params / NFC_ATTR_TARGET_ATS scenario
    from the commit message. No code change; the ntf = {} fix is unchanged.

 net/nfc/nci/ntf.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/nfc/nci/ntf.c b/net/nfc/nci/ntf.c
index c96512bb8653..274d9a4202c9 100644
--- a/net/nfc/nci/ntf.c
+++ b/net/nfc/nci/ntf.c
@@ -440,7 +440,7 @@ void nci_clear_target_list(struct nci_dev *ndev)
 static int nci_rf_discover_ntf_packet(struct nci_dev *ndev,
 				      const struct sk_buff *skb)
 {
-	struct nci_rf_discover_ntf ntf;
+	struct nci_rf_discover_ntf ntf = {};
 	const __u8 *data;
 	bool add_target = true;
 
@@ -688,7 +688,7 @@ static int nci_rf_intf_activated_ntf_packet(struct nci_dev *ndev,
 					    const struct sk_buff *skb)
 {
 	struct nci_conn_info *conn_info;
-	struct nci_rf_intf_activated_ntf ntf;
+	struct nci_rf_intf_activated_ntf ntf = {};
 	const __u8 *data;
 	int err = NCI_STATUS_OK;
 

base-commit: 02f144fbb4c86c360495d33debe307cb46a57f95
-- 
2.54.0


^ permalink raw reply related

* [PATCH net-next] ipv4: fib: fix route re-dump in inet_dump_fib() on multi-batch dump
From: Pengfei Zhang @ 2026-06-26  8:56 UTC (permalink / raw)
  To: dsahern, idosch
  Cc: davem, edumazet, kuba, pabeni, horms, netdev, linux-kernel,
	chenzhangqi, baohua, zhangpengfei16, Pengfei Zhang

inet_dump_fib() saves its progress in cb->args[1] as a positional
index within the current hash chain.  Between batches, a concurrent
fib_new_table() can insert a new table at the chain head, shifting
all existing entries.  On resume the saved index lands on a different
table, causing already-dumped tables to be re-dumped and the
originally suspended table to restart from the beginning.

Fix by storing tb->tb_id in cb->args[1] instead of a positional
index, mirroring the fix applied to inet6_dump_fib().

Fixes: 1b43af5480c3 ("[IPV6]: Increase number of possible routing tables to 2^32")
Signed-off-by: Pengfei Zhang <zhangfeionline@gmail.com>
---
Consider a hash slot containing two tables [A(pos=0), B(pos=1)] where
B is large enough to require multiple batches.  On the first batch, B
suspends mid-walk and the loop saves:

  cb->args[1] = e;   /* e=1, position of B in the chain */

The lock is then released.  At this point a concurrent fib_new_table()
inserts table C at the chain head via hlist_add_head_rcu(), making the
chain [C(pos=0), A(pos=1), B(pos=2)].

On the next batch, inet_dump_fib() resumes with s_e=1 and iterates:

  s_e = cb->args[1];   /* s_e = 1 */
  hlist_for_each_entry_rcu(tb, head, tb_hlist) {
      if (e < s_e)     /* skip C at pos=0 */
          goto next;
      /* e=1: tb now points to A, not B */
      if (dumped)
          memset(...);  /* resets B's suspended progress */
      fib_table_dump(tb, ...);   /* re-dumps A from scratch */
      dumped = 1;
      /* e=2: tb now points to B */
      fib_table_dump(tb, ...);   /* re-dumps B from beginning */
  }

Routes from A are dumped twice, and the portion of B that was already
dumped in the first batch is dumped again.

 net/ipv4/fib_frontend.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 42212970d..65fa245af 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -1019,10 +1019,11 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 		.dump_routes = true,
 		.dump_exceptions = true,
 	};
-	unsigned int e = 0, s_e, h, s_h;
 	struct hlist_head *head;
 	int dumped = 0, err = 0;
+	unsigned int h, s_h;
 	struct fib_table *tb;
+	u32 s_id;
 
 	rcu_read_lock();
 	if (cb->strict_check) {
@@ -1054,29 +1055,28 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 	}
 
 	s_h = cb->args[0];
-	s_e = cb->args[1];
+	s_id = cb->args[1];
 
 	err = 0;
-	for (h = s_h; h < FIB_TABLE_HASHSZ; h++, s_e = 0) {
-		e = 0;
+	for (h = s_h; h < FIB_TABLE_HASHSZ; h++, s_id = 0) {
 		head = &net->ipv4.fib_table_hash[h];
 		hlist_for_each_entry_rcu(tb, head, tb_hlist) {
-			if (e < s_e)
-				goto next;
+			if (s_id && tb->tb_id != s_id)
+				continue;
+
+			s_id = 0;
 			if (dumped)
 				memset(&cb->args[2], 0, sizeof(cb->args) -
 						 2 * sizeof(cb->args[0]));
+			cb->args[1] = tb->tb_id;
 			err = fib_table_dump(tb, skb, cb, &filter);
 			if (err < 0)
 				goto out;
 			dumped = 1;
-next:
-			e++;
 		}
 	}
 out:
 
-	cb->args[1] = e;
 	cb->args[0] = h;
 
 unlock:
-- 
2.34.1


^ permalink raw reply related

* [PATCH v2] xsk: fix memory corruptions in net/core/xdp.c
From: Clement Lecigne @ 2026-06-26  8:52 UTC (permalink / raw)
  To: aleksander.lobakin, edumazet, netdev
  Cc: clecigne, bpf, linux-kernel, kuba, sdf, horms, john.fastabend,
	ast, daniel
In-Reply-To: <0922ce5d-48d8-44e7-983c-e547f3126ef4@intel.com>

From: Clément Lecigne <clecigne@google.com>

Commit 560d958c6c68 ("xsk: add generic XSk &xdp_buff -> skb conversion")
introduced a vulnerability in the handling of XDP_PASS for AF_XDP zero-copy
frames.

Note: Currently, this specific AF_XDP zero-copy conversion path is only
reachable from the drivers/net/ethernet/intel/ice and
drivers/net/ethernet/intel/idpf drivers.

When building an skb, xdp_build_skb_from_zc() uses the chunk size
(xdp->frame_sz) for the allocation. However, napi_build_skb() automatically
reserves space at the end of the allocation for the skb_shared_info
structure. 

Most high performance UMEM applications use 4K chunks, where the
corruption cannot happen. However, if the UMEM is configured with 2KB
chunks (a very common configuration to maximize packet density in memory),
a standard 1500 MTU packet will trigger the corruption because the required
space exceeds the 2048 byte chunk size:

Headroom (256) + Packet (1514) + skb_shared_info (320) = 2090 bytes

Because 2090 bytes > 2048 bytes and __skb_put() does not perform bounds
checking, the memcpy() writes past the available linear data area and
corrupts the skb_shared_info structure. This can lead to arbitrary code
execution if pointers like destructor_arg are overwritten.

Additionally, in xdp_copy_frags_from_zc(), the allocation size is set
strictly to the fragment size (len), but the subsequent memcpy() uses
LARGEST_ALIGN(len). This mismatch results in an out-of-bounds write of
up to 7 bytes, which triggers KASAN warnings and is unsafe despite typical
page pool allocator padding.

Fix the skb allocation in xdp_build_skb_from_zc() by dynamically
calculating the exact truesize required using SKB_HEAD_ALIGN() to
properly account for the headroom, the packet length, and the
skb_shared_info overhead.

Fix the out-of-bounds write in xdp_copy_frags_from_zc() by rounding up
the allocation request using LARGEST_ALIGN(len) to match the copy
operation.

Fixes: 560d958c6c68 ("xsk: add generic XSk &xdp_buff -> skb conversion")
CC: Alexander Lobakin <aleksander.lobakin@intel.com>
CC: Eric Dumazet <edumazet@google.com>
Signed-off-by: Clément Lecigne <clecigne@google.com>
---
Changes since v1:
 - Used SKB_HEAD_ALIGN to properly calculate the required allocation size
   including the skb_shared_info overhead.
 - Re-ordered variable declarations.

---
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 9890a30584ba..52546746378a 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -698,8 +698,8 @@ static noinline bool xdp_copy_frags_from_zc(struct sk_buff *skb,
 
 	for (u32 i = 0; i < nr_frags; i++) {
 		const skb_frag_t *frag = &xinfo->frags[i];
-		u32 len = skb_frag_size(frag);
-		u32 offset, truesize = len;
+		u32 offset, len = skb_frag_size(frag);
+		u32 truesize = LARGEST_ALIGN(len);
 		struct page *page;
 
 		page = page_pool_dev_alloc(pp, &offset, &truesize);
@@ -738,9 +738,10 @@ static noinline bool xdp_copy_frags_from_zc(struct sk_buff *skb,
  */
 struct sk_buff *xdp_build_skb_from_zc(struct xdp_buff *xdp)
 {
+	u32 headroom = xdp->data_meta - xdp->data_hard_start;
 	const struct xdp_rxq_info *rxq = xdp->rxq;
 	u32 len = xdp->data_end - xdp->data_meta;
-	u32 truesize = xdp->frame_sz;
+	u32 truesize = SKB_HEAD_ALIGN(headroom + len);
 	struct sk_buff *skb = NULL;
 	struct page_pool *pp;
 	int metalen;
@@ -762,7 +763,7 @@ struct sk_buff *xdp_build_skb_from_zc(struct xdp_buff *xdp)
 	}
 
 	skb_mark_for_recycle(skb);
-	skb_reserve(skb, xdp->data_meta - xdp->data_hard_start);
+	skb_reserve(skb, headroom);
 
 	memcpy(__skb_put(skb, len), xdp->data_meta, LARGEST_ALIGN(len));
 

^ permalink raw reply related

* Re: [patch 09/24] timekeeping: Add CLOCK_AUX support for ktime_get_snapshot_id()
From: Thomas Weißschuh @ 2026-06-26  8:48 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, David Woodhouse, Miroslav Lichvar, John Stultz,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker,
	Arthur Kiyanovski, Rodolfo Giometti, Vincent Donnefort,
	Marc Zyngier, Oliver Upton, kvmarm, Oliver Upton, Richard Cochran,
	netdev, Takashi Iwai, Miri Korenblit, Johannes Berg, Jacob Keller,
	Tony Nguyen, Saeed Mahameed, Peter Hilber, Michael S. Tsirkin,
	virtualization, linux-wireless, linux-sound
In-Reply-To: <20260526171223.374814973@kernel.org>

On Tue, May 26, 2026 at 07:14:13PM +0200, Thomas Gleixner wrote:
(...)

>  static inline void tk_update_aux_offs(struct timekeeper *tk, ktime_t offs)
> @@ -1218,6 +1223,12 @@ bool ktime_get_snapshot_id(struct system
>  		tkd = &tk_core;
>  		offs = &tk_core.timekeeper.offs_boot;
>  		break;
> +	case CLOCK_AUX ... CLOCK_AUX_LAST:
> +		tkd = aux_get_tk_data(clock_id);
> +		if (!tkd)
> +			return false;
> +		offs = &tkd->timekeeper.offs_aux;
> +		break;

'tkd' is also used to compute 'monoraw'. However 'tkr_raw' and 'tkr_mono'
are the same for auxilary clocks, so this will compute a wrong 'monoraw'.
Instead 'monoraw' should be computed based on 'tk_core'.
Which then also requires the sequence locking of 'tk_core'.

As you know I have a series which unifies the locking between the
different timekeepers. Maybe we revert this patch for 7.2 and I send
a fixed variant including the prerequisites for 7.3.

(The same goes for get_device_system_crosststamp())

>  	default:
>  		WARN_ON_ONCE(1);
>  		return false;
> @@ -1228,6 +1239,10 @@ bool ktime_get_snapshot_id(struct system
>  	do {
>  		seq = read_seqcount_begin(&tkd->seq);
>  
> +		/* Aux clocks can be invalid */
> +		if (!tk->clock_valid)
> +			return false;
> +
>  		now = tk_clock_read(&tk->tkr_mono);
>  		systime_snapshot->cs_id = tk->tkr_mono.clock->id;
>  		systime_snapshot->cs_was_changed_seq = tk->cs_was_changed_seq;
> 

^ permalink raw reply

* [PATCH v2] Subject: [PATCH] net: gro: fix double aggregation of flush-marked skbs
From: Shiming Cheng @ 2026-06-26  8:44 UTC (permalink / raw)
  To: netdev, davem, edumazet, kuba, pabeni, horms, matthias.bgg,
	angelogioacchino.delregno, willemb, imv4bel, alice,
	eilaimemedsnaimel, sd
  Cc: lena.wang, stable, Shiming Cheng

The new skb_gro_receive_list() function is missing a critical safety check
present in the legacy skb_gro_receive() path. Specifically, it does not
validate NAPI_GRO_CB(skb)->flush before allowing packet aggregation.

This allows already-GRO'd packets with existing frag_list to be
re-aggregated into a new GRO session, corrupting the frag_list chain
structure. When skb_segment() attempts to unpack these malformed packets,
it encounters invalid state and triggers a kernel panic.

Scenario (Tethering/Device forwarding):
  1. Driver: Generated aggregated packet P1 via LRO with frag_list
  2. Dev A: Receives aggregated fraglist packet and flush flag set
  3. Dev A: Re-enters GRO, skb_gro_receive_list() is called
  4. Missing flush check allows re-aggregation despite flush flag
  5. Frag_list chain becomes corrupted (loops or dangling refs)
  6. Dev B: TX path calls skb_segment(), crashes on corrupted frag_list

Root cause in skb_segment():
  The check at line ~4891:
    if (hsize <= 0 && i >= nfrags && skb_headlen(list_skb) &&
        (skb_headlen(list_skb) == len || sg)) {

  When frag_list is corrupted by double aggregation, when list_skb is
  a NULL pointer from skb->next, skb_headlen(list_skb) dereference
  NULL/corrupted pointers occurs.

Call Trace:
 skb_headlen(NULL skb)
 skb_segment
 tcp_gso_segment
 tcp4_gso_segment
 inet_gso_segment
 skb_mac_gso_segment
 __skb_gso_segment
 skb_gso_segment
 validate_xmit_skb
 validate_xmit_skb_list
 sch_direct_xmit
 qdisc_restart
 __qdisc_run
 qdisc_run
 net_tx_action

Fix: Add NAPI_GRO_CB(skb)->flush validation to the early-return check in
skb_gro_receive_list(), matching the defensive programming pattern of
skb_gro_receive().

Fixes: 9dc2c3cd6c11 ("net: add fraglist GRO/GSO support")
Cc: stable@vger.kernel.org
Signed-off-by: Shiming Cheng <shiming.cheng@mediatek.com>
---
 net/core/gro.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/gro.c b/net/core/gro.c
index 35f2f708f010..076247c1e662 100644
--- a/net/core/gro.c
+++ b/net/core/gro.c
@@ -229,7 +229,8 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
 
 int skb_gro_receive_list(struct sk_buff *p, struct sk_buff *skb)
 {
-	if (unlikely(p->len + skb->len >= 65536))
+	if (unlikely(p->len + skb->len >= 65536 ||
+		     NAPI_GRO_CB(skb)->flush))
 		return -E2BIG;
 
 	if (!pskb_may_pull(skb, skb_gro_offset(skb))) {
-- 
2.45.2


^ permalink raw reply related

* [PATCH net] e1000e: fix IRQ leak when request_irq() fails in e1000_request_msix()
From: Jiayuan Chen @ 2026-06-26  8:39 UTC (permalink / raw)
  To: netdev
  Cc: Jiayuan Chen, Tony Nguyen, Przemek Kitszel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jeff Garzik, Bruce Allan, intel-wired-lan, linux-kernel

An internal syzbot instance reported the warning below.

comedi (comedi_parport) lets userspace request_irq() an arbitrary IRQ
number and can thus grab one of e1000e's MSI-X vectors. When
e1000_request_msix() then fails partway through, it returned without
freeing the vectors it had already requested; pci_disable_msix() later
tears those descriptors down while their irqaction is still attached,
leaking the /proc/irq entry.

Free the already requested IRQs on the error path.

genirq: Flags mismatch irq 28. 00200000 (eth1-tx-0) vs. 00200000 (comedi_parport)

remove_proc_entry: removing non-empty directory 'irq/27', leaking at least 'eth1-rx-0'
WARNING: fs/proc/generic.c:742 at remove_proc_entry+0x436/0x560, CPU#3: ip/445
Modules linked in:
CPU: 3 UID: 0 PID: 445 Comm: ip Not tainted 7.1.0+ #284 PREEMPT
RIP: 0010:remove_proc_entry (fs/proc/generic.c:742 (discriminator 4))
PKRU: 55555554
Call Trace:
<TASK>
unregister_irq_proc (kernel/irq/proc.c:406)
free_desc (kernel/irq/irqdesc.c:482)
irq_free_descs (kernel/irq/irqdesc.c:874 kernel/irq/irqdesc.c:865)
irq_domain_free_irqs (kernel/irq/irqdomain.c:1917)
msi_domain_free_locked.part.0 (kernel/irq/msi.c:1619 kernel/irq/msi.c:1645)
msi_domain_free_irqs_all_locked (kernel/irq/msi.c:1632)
pci_msi_teardown_msi_irqs (drivers/pci/msi/irqdomain.c:28)
pci_free_msi_irqs (drivers/pci/msi/msi.c:925)
pci_disable_msix (drivers/pci/msi/api.c:200 drivers/pci/msi/api.c:193)
e1000_request_irq (drivers/net/ethernet/intel/e1000e/netdev.c:2028)
e1000e_open (drivers/net/ethernet/intel/e1000e/netdev.c:4681)
__dev_open (net/core/dev.c:1702)
netif_change_flags (net/core/dev.c:9806)
do_setlink.isra.0 (net/core/rtnetlink.c:3207 (discriminator 1))
rtnetlink_rcv_msg (net/core/rtnetlink.c:7068)
netlink_rcv_skb (net/netlink/af_netlink.c:2556)

Fixes: 4662e82b2cb4 ("e1000e: add support for new 82574L part")
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Assisted-by: Claude:claude-opus-4-8
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index 808e5cddd6a9..19b9823c5679 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -2099,7 +2099,7 @@ void e1000e_set_interrupt_capability(struct e1000_adapter *adapter)
 static int e1000_request_msix(struct e1000_adapter *adapter)
 {
 	struct net_device *netdev = adapter->netdev;
-	int err = 0, vector = 0;
+	int err = 0, vector = 0, i;
 
 	if (strlen(netdev->name) < (IFNAMSIZ - 5))
 		snprintf(adapter->rx_ring->name,
@@ -2111,7 +2111,7 @@ static int e1000_request_msix(struct e1000_adapter *adapter)
 			  e1000_intr_msix_rx, 0, adapter->rx_ring->name,
 			  netdev);
 	if (err)
-		return err;
+		goto err_free;
 	adapter->rx_ring->itr_register = adapter->hw.hw_addr +
 	    E1000_EITR_82574(vector);
 	adapter->rx_ring->itr_val = adapter->itr;
@@ -2127,7 +2127,7 @@ static int e1000_request_msix(struct e1000_adapter *adapter)
 			  e1000_intr_msix_tx, 0, adapter->tx_ring->name,
 			  netdev);
 	if (err)
-		return err;
+		goto err_free;
 	adapter->tx_ring->itr_register = adapter->hw.hw_addr +
 	    E1000_EITR_82574(vector);
 	adapter->tx_ring->itr_val = adapter->itr;
@@ -2136,11 +2136,16 @@ static int e1000_request_msix(struct e1000_adapter *adapter)
 	err = request_irq(adapter->msix_entries[vector].vector,
 			  e1000_msix_other, 0, netdev->name, netdev);
 	if (err)
-		return err;
+		goto err_free;
 
 	e1000_configure_msix(adapter);
 
 	return 0;
+
+err_free:
+	for (i = vector - 1; i >= 0; i--)
+		free_irq(adapter->msix_entries[i].vector, netdev);
+	return err;
 }
 
 /**
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH net-next] Documentation: networking: Add a test plan for ethtool pause validation
From: Maxime Chevallier @ 2026-06-26  8:33 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Jakub Kicinski, davem, Eric Dumazet, Paolo Abeni, Simon Horman,
	Russell King, Heiner Kallweit, Jonathan Corbet, Shuah Khan,
	Oleksij Rempel, Vladimir Oltean, Florian Fainelli,
	thomas.petazzoni, netdev, linux-kernel, linux-doc
In-Reply-To: <58f37d6e-973b-4242-be82-0561ccdb1a6f@lunn.ch>


> Sphinx follows pythons object orientate structure. So you could have a
> class test_ethtool_pause_advertising, with class documentation. And
> then methods within the class which are individual tests.  The
> commented out section would then be method documentation.

Good point, so maybe something along these lines :

 - A class for the test
 - methods for indivitual tests
 - For readability, I've written what the internal test helper would look
   like (_adv_test), and how a test would look like without the helper in
   adv_rx_on_tx_on().

I'm already diving into coding, but it helps me a bit in the definition of the
"description" format :)

this is what the class would look like :


class test_ethtool_pause_advertising:
    """Pause advertisement

    Validate that changing pause params through the ETHTOOL_MSG_PAUSE command
    translates to a change in the advertised pause params, and that these
    parameters are correct w.r.t the supported pause params and requested pause
    params.
    
    This exercises the .set_pauseparams() ethtool ops for MAC configuration,
    as well as the reconfiguration of the PHY's advertising and negociation.
    
    On non-phylink MACs, the MAC should call phy_set_sym_pause() to update the
    PHY's advertising, and restart a negotiation with phy_start_aneg() if
    need be. Failure to do so will result on the wrong advertising parameters.
    
    Pn phylink-enabled MACs, phylink deals with the PHY reconfiguration provided
    the MAC driver calls phylink_ethtool_set_pauseparam().
    
    Failing this test likely means that the PHY driver is not correctly advertising
    pause settings, either due to the MAC triggering a PHY reconfiguration,
    a misconficonfiguration of the advertising registers by the PHY, or by
    mis-handling the phydev->advertising bitfield in the PHY driver directly.
    
    The validation is made by looking at the advertised modes locally, as well as
    what the peer's 'lp_advertising' values report.

    cfg -- local device's interface configuration
    peer -- peer device handle
    """

    def _adv_test(cfg, peer, rx, tx, adv, not_adv):
        ret = cfg.run(f"ethtool -A ethX rx {rx} tx {tx} autoneg on")
        ksft_eq(ret, 0)

        linkmodes = cfg.get_advertising()
        if adv:
            ksft_in(adv, linkmodes, f"rx {rx} tx {tx} must advertise {adv}")

        if not_adv:
            ksft_not_in(not_adv, linkmodes, f"rx {rx} tx {tx} must not advertise {not_adv}")

        remote_linkmodes = peer.get_lp_advertising()

        if adv:
            ksft_in(adv, linkmodes, f"PHY does not advertise {adv}")

        if not_adv:
            ksft_not_in(not_adv, linkmodes, f"PHY incorrectly advertises {not_adv}")


    @ksft_ethtool_needs_supported_allof([Pause])
    def adv_rx_on_tx_on(cfg, peer) -> None:
        """Advertising test with rx on tx on

        - run 'ethtool -A ethX rx on tx on autoneg on'
        - FAIL if the return isn't 0
        - FAIL if ETHTOOL_A_LINKMODES_OURS's advertised values does not contain
          "Pause" or contains "Asym_Pause"
        - FAIL if peer's lp_advertising doesn't contain "Pause" or contains
          "Asym_Pause"
        - Succeed otherwise
        """
        ret = cfg.run('ethtool -A ethX rx on tx on autoneg on')
        ksft_eq(ret, 0)

        linkmodes = cfg.get_advertising()
        ksft_in('Pause', linkmodes, "rx on tx on must advertise Pause")
        ksft_not_in('Asym_Pause', linkmodes, "rx on tx on must not advertise Asym_Pause")

        remote_linkmodes = peer.get_lp_advertising()
        ksft_in('Pause', linkmodes, "PHY does not advertise Pause")
        ksft_not_in('Asym_Pause', linkmodes, "PHY incorrectly advertises Asym_Pause")


    @ksft_ethtool_needs_supported_allof([Pause, Asym_Pause])
    def adv_rx_on_tx_off(cfg, peer) -> None:
        """Advertising test with rx on tx off

        - run 'ethtool -A ethX rx on tx off autoneg on'
        - FAIL if the return isn't 0
        - FAIL if ETHTOOL_A_LINKMODES_OURS's advertised values does not contain
          "Pause" and "Asym_Pause"
        - FAIL if peer's lp_advertising doesn't contain "Pause" and "Asym_Pause"
        - Succeed otherwise
        """

        _adv_test(cfg, peer, 'on', 'off', ["Pause", "Asym_Pause"], [])

    @ksft_ethtool_needs_supported_allof([Asym_Pause])
    def adv_rx_off_tx_on(cfg, peer) -> None:
        """Advertising test with rx off tx on

        - run 'ethtool -A ethX rx off tx on autoneg on'
        - FAIL if the return isn't 0
        - FAIL if ETHTOOL_A_LINKMODES_OURS's advertised values does not contain
          "Asym_Pause" or contains "Pause"
        - FAIL if peer's lp_advertising doesn't contain "Pause" and "Asym_Pause"
        - Succeed otherwise
        """

        _adv_test(cfg, peer, 'off', 'on', ["Asym_Pause"], ["Pause"])


Maxime

^ permalink raw reply

* [PATCH net v2 1/1] net: sched: ets: avoid deficit wrap and bound empty dequeue  rounds
From: Ren Wei @ 2026-06-26  8:32 UTC (permalink / raw)
  To: netdev
  Cc: jhs, jiri, davem, petrm, yuantan098, yifanwucs, tomapufckgml,
	zcliangcn, bird, bronzed_45_vested, n05ec

From: Wyatt Feng <bronzed_45_vested@icloud.com>

ETS keeps each DRR-style deficit in a u32 and replenishes it with
the configured quantum whenever the head packet is too large. Both
the quantum and qdisc_pkt_len() are user-controlled inputs: a large
quantum can wrap the deficit counter, while a tiny quantum combined
with an inflated qdisc_pkt_len() can force billions of iterations in
softirq context before any packet becomes eligible.

Store the deficit in u64 so replenishment cannot wrap the counter.
This keeps the existing dequeue logic unchanged while fixing the
overflow condition.

Bound one dequeue attempt to at most nbands * 2 ETS rotations, as
suggested in review. This avoids the livelock without adding heavier
logic to the fast path.

Fixes: dcc68b4d8084 ("net: sch_ets: Add a new Qdisc")
Cc: stable@vger.kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Zhengchuan Liang <zcliangcn@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Assisted-by: Codex:GPT-5.4
Signed-off-by: Wyatt Feng <bronzed_45_vested@icloud.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
---
changes in v2:
  - Instead of doing a div() in the fast path, simply bound the loop per
    dequeue
  - v1 Link: https://lore.kernel.org/all/20260615103759.2404228-2-n05ec@lzu.edu.cn/


 net/sched/sch_ets.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/sched/sch_ets.c b/net/sched/sch_ets.c
index cb8cf437ce87..12a156ccb0a6 100644
--- a/net/sched/sch_ets.c
+++ b/net/sched/sch_ets.c
@@ -40,7 +40,7 @@ struct ets_class {
 	struct list_head alist; /* In struct ets_sched.active. */
 	struct Qdisc *qdisc;
 	u32 quantum;
-	u32 deficit;
+	u64 deficit;
 	struct gnet_stats_basic_sync bstats;
 	struct gnet_stats_queue qstats;
 };
@@ -463,6 +463,8 @@ ets_qdisc_dequeue_skb(struct Qdisc *sch, struct sk_buff *skb)
 static struct sk_buff *ets_qdisc_dequeue(struct Qdisc *sch)
 {
 	struct ets_sched *q = qdisc_priv(sch);
+	unsigned int max_loops = READ_ONCE(q->nbands) * 2;
+	unsigned int loops = 0;
 	struct ets_class *cl;
 	struct sk_buff *skb;
 	unsigned int band;
@@ -499,6 +501,8 @@ static struct sk_buff *ets_qdisc_dequeue(struct Qdisc *sch)
 
 		cl->deficit += READ_ONCE(cl->quantum);
 		list_move_tail(&cl->alist, &q->active);
+		if (++loops > max_loops)
+			goto out;
 	}
 out:
 	return NULL;
-- 
2.47.3


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox