* Re: [PATCH RESEND net-next] net: airoha: Make use of the helper function dev_err_probe()
From: Lorenzo Bianconi @ 2026-06-30 10:38 UTC (permalink / raw)
To: Lei Zhu; +Cc: andrew+netdev, davem, edumazet, kuba, pabeni, netdev
In-Reply-To: <20260630015247.43322-1-zhulei_szu@163.com>
[-- Attachment #1: Type: text/plain, Size: 2262 bytes --]
> From: Lei Zhu <zhulei@kylinos.cn>
>
> Use dev_err_probe() to reduce code size and simplify the code.
>
> Signed-off-by: Lei Zhu <zhulei@kylinos.cn>
> ---
> The last submission was when net-next is closed.Resending it.
>
> drivers/net/ethernet/airoha/airoha_eth.c | 21 +++++++++------------
> 1 file changed, 9 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
> index 31cdb11cd78d..189f64e83a46 100644
> --- a/drivers/net/ethernet/airoha/airoha_eth.c
> +++ b/drivers/net/ethernet/airoha/airoha_eth.c
> @@ -3071,10 +3071,9 @@ static int airoha_probe(struct platform_device *pdev)
> eth->dev = &pdev->dev;
>
> err = dma_set_mask_and_coherent(eth->dev, DMA_BIT_MASK(32));
I do not think dma_set_mask_and_coherent() can return -EPROBE_DEFER, so there
is no point adding dev_err_probe() here.
Regards,
Lorenzo
> - if (err) {
> - dev_err(eth->dev, "failed configuring DMA mask\n");
> - return err;
> - }
> + if (err)
> + return dev_err_probe(eth->dev, err,
> + "failed configuring DMA mask\n");
>
> eth->fe_regs = devm_platform_ioremap_resource_byname(pdev, "fe");
> if (IS_ERR(eth->fe_regs))
> @@ -3087,10 +3086,9 @@ static int airoha_probe(struct platform_device *pdev)
> err = devm_reset_control_bulk_get_exclusive(eth->dev,
> ARRAY_SIZE(eth->rsts),
> eth->rsts);
> - if (err) {
> - dev_err(eth->dev, "failed to get bulk reset lines\n");
> - return err;
> - }
> + if (err)
> + return dev_err_probe(eth->dev, err,
> + "failed to get bulk reset lines\n");
>
> xsi_rsts = devm_kcalloc(eth->dev,
> eth->soc->num_xsi_rsts, sizeof(*xsi_rsts),
> @@ -3105,10 +3103,9 @@ static int airoha_probe(struct platform_device *pdev)
> err = devm_reset_control_bulk_get_exclusive(eth->dev,
> eth->soc->num_xsi_rsts,
> eth->xsi_rsts);
> - if (err) {
> - dev_err(eth->dev, "failed to get bulk xsi reset lines\n");
> - return err;
> - }
> + if (err)
> + return dev_err_probe(eth->dev, err,
> + "failed to get bulk xsi reset lines\n");
>
> eth->napi_dev = alloc_netdev_dummy(0);
> if (!eth->napi_dev)
> --
> 2.25.1
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
* Re: [PATCH v4] net: stmmac: fix fatal bus error on resume by reinitializing RX buffers
From: Paolo Abeni @ 2026-06-30 10:55 UTC (permalink / raw)
To: Ding Hui, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Maxime Coquelin, Alexandre Torgue,
Russell King (Oracle), Maxime Chevallier, Ding Hui,
open list:STMMAC ETHERNET DRIVER,
moderated list:ARM/STM32 ARCHITECTURE,
moderated list:ARM/STM32 ARCHITECTURE, open list
Cc: xiasanbo, yangchen11, liuxuanjun
In-Reply-To: <20260627122533.1165324-1-dinghui1111@163.com>
On 6/27/26 2:25 PM, Ding Hui wrote:
> From: Ding Hui <dinghui@lixiang.com>
>
> On suspend, stmmac_suspend() calls stmmac_disable_all_queues() which
> stops the RX NAPI, but the RX DMA engine may still be running for a
> short window before stmmac_stop_all_dma() takes effect. During that
> window the hardware can write incoming frames into the buffers pointed
> to by the RX descriptors and write back the descriptors (clearing the
> OWN bit and overwriting RDES0/1/2 with status/timestamp data). Because
> NAPI is already disabled, the driver never refills these descriptors,
> so the RX ring is left in a "consumed but not refilled" state with
> stale content in the descriptor buffer-address fields.
>
> On resume, stmmac_clear_descriptors() only re-arms the OWN bit and
> does not repopulate the RX buffer address fields. When the DMA is
> restarted it dereferences these stale addresses and triggers a fatal
> bus error (not kernel panic, just a Fatal Bus Error interrupt and
> RX DMA engine halts).
>
> Fix this by introducing stmmac_reinit_rx_descriptors(), called from
> stmmac_resume() immediately after stmmac_clear_descriptors(). The
> helper iterates every RX descriptor slot and re-programs its buffer
> address fields:
>
> - For normal (page_pool) queues: restore RDES0/1 from buf->addr and
> RDES2 from buf->sec_addr. The DMA mapping has remained valid across
> suspend/resume because no pages were freed. Slots left NULL by a
> prior GFP_ATOMIC failure in stmmac_rx_refill() before suspend
> are re-allocated here with GFP_KERNEL;
> -ENOMEM is returned and resume is aborted if allocation fails.
> The slots with null buffer are unacceptable, because they will
> cause a DMA suspend dead lock problem by the condition of
> Current Descriptor Pointer == Descriptor Tail Pointer.
>
> - For AF_XDP zero-copy queues: restore the DMA address from
> xsk_buff_xdp_get_dma(buf->xdp). Slots with no xdp buffer
> (e.g. TX-only socket, empty fill ring) attempt xsk_buff_alloc()
> first; on failure the descriptor is zeroed so the DMA engine skips
> the slot safely via an RBU event.
>
> - For chain mode: call stmmac_mode_init() to rebuild the des3 next-
> descriptor pointer chain, which hardware may have overwritten with
> a PTP timestamp value (as noted in chain_mode.c:refill_desc3()).
>
> After reprogramming all address fields, a final pass restores OWN=1
> on every valid slot. This is necessary because set_sec_addr and
> chain-mode init unconditionally overwrite des3 (clearing the OWN bit
> set by stmmac_clear_descriptors()), and must run after all address
> writes are complete.
>
> Also fix stmmac_init_rx_buffers() to actually use its gfp_t flags
> parameter instead of the hardcoded GFP_ATOMIC | __GFP_NOWARN.
>
> Signed-off-by: Ding Hui <dinghui@lixiang.com>
This looks like 'net' material, it should specify 'net' into the subj
prefix and include a suitable Fixes tag.
> ---
> Changes in v4:
> - Just add description for return value of 'stmmac_reinit_rx_descriptors'.
> - Link to v3:
> https://lore.kernel.org/netdev/20260604144557.3175399-1-dinghui1111@163.com/
>
> Changes in v3:
> - Re-allocate page_pool NULL slots (from prior GFP_ATOMIC failures)
> with GFP_KERNEL in stmmac_reinit_rx_descriptors(); return -ENOMEM and
> abort resume.
> - For XSK NULL slots, attempt xsk_buff_alloc() first; fall back to
> stmmac_clear_desc() only when allocation fails.
> - Add a re-arm loop at the end of stmmac_reinit_rx_descriptors() to
> restore OWN=1 on all valid slots, since set_sec_addr and
> chain-mode init both write des3 unconditionally.
> - stmmac_reinit_rx_descriptors() now returns int; stmmac_resume()
> checks the return value and propagates -ENOMEM with mutex/rtnl cleanup.
> - Fix stmmac_init_rx_buffers() to use its flags parameter instead of
> hardcoded GFP_ATOMIC | __GFP_NOWARN.
> (884d2b845477 ("net: stmmac: Add GFP_DMA32 for rx buffers if no 64
> capability"))
> - Run stmmac_reinit_rx_descriptors() after stmmac_clear_descriptors()
> so that stmmac_clear_desc() on XSK NULL slots overrides the OWN
> bit set by stmmac_clear_descriptors().
> - Update commit message.
> - Link to v2:
> https://lore.kernel.org/netdev/20260526022620.501229-1-dinghui1111@163.com/
>
> Changes in v2:
> - Introducing stmmac_reinit_rx_descriptors() to reinitializing rx
> buffers without any allocation.
> - Modify commit log.
> - Link to v1:
> https://lore.kernel.org/netdev/20260515053856.2310369-1-dinghui1111@163.com/
> ---
> .../net/ethernet/stmicro/stmmac/stmmac_main.c | 164 +++++++++++++++++-
> 1 file changed, 163 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index 3591755ea30b..c82f3d5dbd43 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -1660,7 +1660,7 @@ static int stmmac_init_rx_buffers(struct stmmac_priv *priv,
> {
> struct stmmac_rx_queue *rx_q = &dma_conf->rx_queue[queue];
> struct stmmac_rx_buffer *buf = &rx_q->buf_pool[i];
> - gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN);
> + gfp_t gfp = flags;
The above should go via a separate (net) patch.
>
> if (priv->dma_cap.host_dma_width <= 32)
> gfp |= GFP_DMA32;
> @@ -1693,6 +1693,148 @@ static int stmmac_init_rx_buffers(struct stmmac_priv *priv,
> return 0;
> }
>
> +/**
> + * stmmac_reinit_rx_descriptors - re-program RX descriptor buffer addresses
> + * after stmmac_clear_descriptors()
> + * @priv: driver private structure
> + * @dma_conf: structure holding the dma data
> + * @queue: RX queue index
> + *
> + * Description: Called in the resume path after stmmac_clear_descriptors()
> + * has re-armed the OWN bit on every descriptor. Walk buf_pool[] and
> + * re-program the buffer-address fields of every RX descriptor from the
> + * buffers that are already attached to the queue. Slots whose page was
> + * never allocated (GFP_ATOMIC failure before suspend) are re-allocated
> + * here with GFP_KERNEL; the resume path is in process context.
> + *
> + * Between suspend and resume the hardware may have written back status/
> + * length information into the descriptor address fields (RDESx are reused
> + * for status on completion for GMAC4/XGMAC), so the address fields must be
> + * repopulated before the DMA is restarted.
> + *
> + * For XSK slots that have no xdp buffer at suspend time (TX-only socket,
> + * empty fill ring for Rx), xsk_buff_alloc() is attempted but does not
> + * return an error on failure because we can't identify a real TX-only
> + * socket from an alloc error (same as stmmac_alloc_rx_buffers_zc() in
> + * __init_dma_rx_desc_rings); on failure the descriptor is zeroed so the DMA
> + * engine skips the slot safely.
> + *
> + * To avoid the DMA stall after resume in non-XSK mode, this function
> + * re-allocates pages for NULL slots using GFP_KERNEL (the resume path runs
> + * in process context). If allocation fails, -%ENOMEM is returned immediately
> + * and the resume is aborted; the caller should report the error.
> + *
> + * This helper must be called after stmmac_clear_descriptors() and before
> + * stmmac_hw_setup() in stmmac_resume() because we need to wipe the OWN bit
> + * set in stmmac_clear_descriptors() for NULL slots in XSK mode.
Please try to condense the above text in one or 2 paragraph.
> + *
> + * Returns: 0 on success, or a negative errno on allocation failure in
> + * non-XSK mode (e.g. -%ENOMEM).
> + */
> +static int stmmac_reinit_rx_descriptors(struct stmmac_priv *priv,
> + struct stmmac_dma_conf *dma_conf,
> + u32 queue)
> +{
> + struct stmmac_rx_queue *rx_q = &dma_conf->rx_queue[queue];
> + struct stmmac_rx_buffer *buf;
> + struct dma_desc *p;
> + int i;
> +
> + if (rx_q->xsk_pool) {
> + for (i = 0; i < dma_conf->dma_rx_size; i++) {
> + buf = &rx_q->buf_pool[i];
> + p = stmmac_get_rx_desc(priv, rx_q, i);
> +
> + /* The XSK pool may not be fully populated (e.g.
> + * xdpsock TX-only, empty fill ring). Try to refill
> + * from the pool; on failure zero the descriptor so the
> + * DMA engine skips this slot safely.
> + */
> + if (!buf->xdp) {
> + buf->xdp = xsk_buff_alloc(rx_q->xsk_pool);
> + if (!buf->xdp) {
> + stmmac_clear_desc(priv, p);
> + continue;
> + }
> + }
> +
> + stmmac_set_desc_addr(priv, p,
> + xsk_buff_xdp_get_dma(buf->xdp));
> + stmmac_set_desc_sec_addr(priv, p, 0, false);
> + }
> + } else {
> + for (i = 0; i < dma_conf->dma_rx_size; i++) {
> + buf = &rx_q->buf_pool[i];
> + p = stmmac_get_rx_desc(priv, rx_q, i);
> +
> + /* buf->page can be NULL when stmmac_rx_refill() hit a
> + * GFP_ATOMIC failure before suspend and left the slot
> + * without a buffer. The resume path runs in process
> + * context, so re-allocate with GFP_KERNEL. Allocation
> + * failure aborts the resume.
> + */
> + if (!buf->page) {
> + int err;
> +
> + err = stmmac_init_rx_buffers(priv, dma_conf, p,
> + i, GFP_KERNEL,
> + queue);
> + if (err)
> + return err;
> + /* stmmac_init_rx_buffers() already programmed
> + * the descriptor; skip the reprogramming below.
> + */
> + continue;
> + }
> +
> + stmmac_set_desc_addr(priv, p, buf->addr);
> + stmmac_set_desc_sec_addr(priv, p, buf->sec_addr,
> + priv->sph_active &&
> + buf->sec_page);
AFAICS stmmac_rx_refill() can error out after successfully allocating
buf->page, but leaving a NULL sec_page, I think you should try to
realloc even the latter.
Finally this chunk shares quite a bit of code with stmmac_rx_refill()
and stmmac_rx_refill_zc() it would be better try to factor out common
helpers.
/P
^ permalink raw reply
* [PATCH net-next v4 0/2] net: pse-pd: add Realtek/Broadcom PSE MCU support
From: Jonas Jelonek @ 2026-06-30 10:56 UTC (permalink / raw)
To: Oleksij Rempel, Kory Maincent, Andrew Lunn, David S . Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Rob Herring,
Krzysztof Kozlowski, Conor Dooley
Cc: netdev, devicetree, linux-kernel, Daniel Golle, Bjørn Mork,
Jonas Jelonek
This series adds a PSE-PD driver for the microcontroller (MCU) that
fronts the PSE silicon on a range of managed switches, together with its
DT binding.
Hardware model
==============
These boards do not expose the PSE chips to the host directly. A small
microcontroller sits on an I2C/SMBus or UART bus and manages one or more
PSE chips behind it; the host CPU only ever talks to that MCU, using a
fixed 12-byte request/response protocol with a trailing checksum. The
PSE silicon never appears on the bus.
The same protocol family is used by MCUs fronting Realtek PSE chips
(RTL8238B, RTL8239, RTL8239C) and Broadcom PSE chips (BCM59111,
BCM59121), diverging in opcode numbering and a few response layouts. The
driver abstracts that behind a per-dialect opcode table and parser hooks,
selected by the compatible. The specific PSE chip behind the MCU is
detected at runtime and only influences per-chip constants (power scaling
and the per-port cap).
Why the compatible names the protocol, not the chip
===================================================
The compatibles are "realtek,pse-mcu-rtk" and "realtek,pse-mcu-brcm".
This is a deliberate choice and the part most likely to raise questions,
so the reasoning up front.
The node names the protocol dialect, not a part:
- The DT node describes the MCU, not a PSE chip: the PSE chips are
behind the MCU and never appear on the bus, so naming the node after
one (e.g. "realtek,rtl8239") would describe hardware that isn't at
that address.
- The PSE chips are, in principle, usable without this MCU (host-driven
directly) - different hardware with a different programming model
that would warrant its own binding. Claiming the PSE-chip compatibles
here would collide with that.
- Naming the MCU silicon is equally wrong: these are ordinary
general-purpose microcontrollers (GigaDevice, Nuvoton, ...) that vary
across boards and are not dedicated to this application.
- What is fixed, and all the driver needs at DT-parse time, is the
protocol dialect, so the compatible encodes exactly that. The
"realtek" prefix names the owner of the protocol the MCU runs -
Realtek documents it and supplies the firmware - following the
"google,cros-ec-*" pattern (the prefix is the protocol/firmware owner,
not the varying controller silicon). The "-rtk"/"-brcm" suffix selects
the Realtek or Broadcom dialect; the specific PSE chip behind the MCU
is detected at runtime.
One compatible per dialect spans both transports:
- The 12-byte wire protocol is identical over I2C/SMBus and UART; only
the plumbing differs (SMBus vs native framing on I2C, baud rate on
UART), and the transport is already expressed structurally by the
node's parent bus (i2c@... vs serial@...). A "-i2c"/"-uart" suffix
would only duplicate that, for a protocol that does not change across
transports.
- This is the multi-transport model used by e.g. "bosch,bmi160" (one
compatible, separate i2c and spi drivers binding it), rather than the
cros-ec model of per-transport compatibles - cros-ec splits because
its on-wire framing genuinely differs per bus, which is not the case
here.
The binding documents both points as well.
Testing
=======
- Linksys LGS328MPCv2 (RTL8238B, I2C)
- Zyxel GS1900-10HP A1 (BCM59121, UART)
- Zyxel GS1900-10HP B1 (RTL8238B, UART)
- Zyxel GS1920-24HPv2 (BCM59121, SMBus)
- Zyxel XMG1915-10EP (RTL8239C, UART)
- Zyxel XS1930-12HP (RTL8239, SMBus)
---
v3 -> v4:
- move owner setting from core to transport, mitigating possible
use-after-free (Sashiko)
- resend because net-next was still closed
v3: https://lore.kernel.org/all/20260628222705.4052815-1-jelonek.jonas@gmail.com/
v2 -> v3:
- dt-bindings: using brcm instead of bcm for Broadcom
- rename the driver files and Kconfig symbols to realtek-pse-mcu-* /
PSE_REALTEK_MCU* for consistency with the realtek,pse-mcu-* compatibles
- rename driver-internal prefix from 'rtpse_' to 'rtpse_mcu' to
emphasize this targets the MCU-centric setup (and leaves room open
for eventual directly addressable PSE chips)
- rework the vendor-prefix rationale (binding + commit message): the
prefix names the protocol/firmware owner (Realtek documents the protocol
and supplies the firmware), and -rtk/-brcm select the Realtek or Broadcom
protocol dialect
- core: reject zeroed/echo-mismatched responses via the echoed seq_num
(a BCM PORT_ENABLE on port 0 was otherwise accepted from an all-zero
frame)
- core: enable the PoE supply before global-enabling the MCU, and roll
back the global enable on probe failure or driver removal
- core: drop inline from helpers (flagged by automated check)
- uart: update the completion under rx_lock too, so a late frame can no
longer make the next transaction fail spuriously with -EIO
v2: https://lore.kernel.org/netdev/20260612132944.460646-1-jelonek.jonas@gmail.com/
v1 -> v2:
- all points flagged by Sashiko addressed:
- uart: drop frame overflow (return count, not the stored length) so
serdev retains no leftover bytes that would misalign the next response
- uart: guard rx_buf/rx_len with a spinlock to close a data race between
the async receive_buf callback and send/recv
- i2c: return terminal MCU error opcodes (0xfd/0xfe) to the core
immediately instead of polling to the 1 s timeout
- core: cap BCM59121 at 30 W (802.3at) — the basic 8-bit set command
can't program the advertised 60 W (it silently clamped to 51 W)
v1: https://lore.kernel.org/netdev/20260608205758.1830521-1-jelonek.jonas@gmail.com/
---
Jonas Jelonek (2):
dt-bindings: net: pse-pd: add bindings for Realtek/Broadcom PSE MCU
net: pse-pd: add Realtek/Broadcom PSE MCU driver
.../bindings/net/pse-pd/realtek,pse-mcu.yaml | 154 +++
MAINTAINERS | 7 +
drivers/net/pse-pd/Kconfig | 28 +
drivers/net/pse-pd/Makefile | 3 +
drivers/net/pse-pd/realtek-pse-mcu-core.c | 1019 +++++++++++++++++
drivers/net/pse-pd/realtek-pse-mcu-i2c.c | 163 +++
drivers/net/pse-pd/realtek-pse-mcu-uart.c | 156 +++
drivers/net/pse-pd/realtek-pse-mcu.h | 87 ++
8 files changed, 1617 insertions(+)
create mode 100644 Documentation/devicetree/bindings/net/pse-pd/realtek,pse-mcu.yaml
create mode 100644 drivers/net/pse-pd/realtek-pse-mcu-core.c
create mode 100644 drivers/net/pse-pd/realtek-pse-mcu-i2c.c
create mode 100644 drivers/net/pse-pd/realtek-pse-mcu-uart.c
create mode 100644 drivers/net/pse-pd/realtek-pse-mcu.h
base-commit: cef9d6804030793cf8b8796fd6936197d065dd3e
--
2.51.0
^ permalink raw reply
* [PATCH net-next v4 1/2] dt-bindings: net: pse-pd: add bindings for Realtek/Broadcom PSE MCU
From: Jonas Jelonek @ 2026-06-30 10:56 UTC (permalink / raw)
To: Oleksij Rempel, Kory Maincent, Andrew Lunn, David S . Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Rob Herring,
Krzysztof Kozlowski, Conor Dooley
Cc: netdev, devicetree, linux-kernel, Daniel Golle, Bjørn Mork,
Jonas Jelonek
In-Reply-To: <20260630105651.756058-1-jelonek.jonas@gmail.com>
Add a binding for the microcontroller (MCU) that fronts the PSE silicon
on a range of managed switches. The host talks only to the MCU, over
I2C/SMBus or UART, using a fixed message-based protocol; the PSE chips
behind it never appear on the bus.
The compatible names the MCU front-end, not a specific part. These
boards front the PSE silicon with an MCU that presents a stable
message protocol Realtek documents. The PSE chip behind it varies
- Broadcom on older boards, Realtek on newer - and is detected at
runtime; the arrangement appears to be a Realtek MCU-based PoE design
carried across those PSE-chip generations. So the 'realtek' prefix
names that front-end (Realtek's protocol and firmware), not the
general-purpose MCU silicon or the PSE chip - the google,cros-ec-*
model. The '-rtk'/'-brcm' suffix selects the Realtek or Broadcom dialect.
A single compatible per dialect covers both the I2C/SMBus and UART
attachments: the wire protocol is identical across them and the transport
is expressed by the node's parent bus, so it is not encoded in the
compatible.
Both dialects share one protocol family and one device tree contract, so
they are documented in a single binding under the one 'realtek' prefix,
with the '-rtk'/'-brcm' suffix distinguishing the dialect.
Signed-off-by: Jonas Jelonek <jelonek.jonas@gmail.com>
---
.../bindings/net/pse-pd/realtek,pse-mcu.yaml | 154 ++++++++++++++++++
1 file changed, 154 insertions(+)
create mode 100644 Documentation/devicetree/bindings/net/pse-pd/realtek,pse-mcu.yaml
diff --git a/Documentation/devicetree/bindings/net/pse-pd/realtek,pse-mcu.yaml b/Documentation/devicetree/bindings/net/pse-pd/realtek,pse-mcu.yaml
new file mode 100644
index 000000000000..3a414ef38922
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/pse-pd/realtek,pse-mcu.yaml
@@ -0,0 +1,154 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/net/pse-pd/realtek,pse-mcu.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Realtek/Broadcom PSE MCU
+
+maintainers:
+ - Jonas Jelonek <jelonek.jonas@gmail.com>
+
+description: |
+ Microcontroller (MCU) that fronts the PSE hardware on switches using
+ Realtek (RTL8238B, RTL8239, RTL8239C) or Broadcom (BCM59111, BCM59121)
+ PSE chips. The MCU exposes a small message-based protocol over either
+ I2C/SMBus or UART; the actual PSE silicon is not accessed directly. The
+ Realtek and Broadcom variants share this device tree contract but use
+ different protocol opcodes, selected by the compatible.
+
+ The compatible identifies the PSE-MCU protocol dialect, not a specific
+ part. The device here is the MCU: it presents a stable message protocol
+ documented by Realtek, with the PSE silicon behind it - Broadcom on
+ older boards, Realtek on newer - detected at runtime and not described
+ here. The MCU's own silicon is general-purpose and varies across
+ boards, so the 'realtek' vendor prefix names the protocol front-end
+ (following the google,cros-ec pattern); the '-rtk'/'-bcm' suffix
+ selects the Realtek or Broadcom dialect.
+
+ A single compatible per dialect covers both the I2C/SMBus and UART
+ attachments: the wire protocol is identical across them and the
+ transport is already expressed by the node's parent bus, so it is not
+ encoded in the compatible. Transport-specific properties differ
+ accordingly - the I2C attachment carries 'reg' (and, for Realtek,
+ 'realtek,i2c-protocol'), while the UART attachment carries the serial
+ peripheral properties such as 'current-speed'.
+
+properties:
+ compatible:
+ enum:
+ - realtek,pse-mcu-rtk
+ - realtek,pse-mcu-brcm
+
+ reg:
+ maxItems: 1
+
+ power-supply:
+ description: Regulator supplying the PoE power rail.
+
+ enable-gpios:
+ maxItems: 1
+
+ realtek,i2c-protocol:
+ $ref: /schemas/types.yaml#/definitions/string
+ enum: [ i2c, smbus ]
+ description: |
+ Wire framing the MCU firmware expects on the I2C bus. "smbus" means
+ reads carry a leading command byte (0x00) and a repeated start; "i2c"
+ means bare 12-byte writes and reads with no command prefix. Only
+ applies to the Realtek I2C attachment.
+
+required:
+ - compatible
+
+allOf:
+ - $ref: pse-controller.yaml#
+ - $ref: /schemas/serial/serial-peripheral-props.yaml#
+ # The I2C attachment (identified by 'reg') cannot carry serial bus props.
+ - if:
+ required: [reg]
+ then:
+ properties:
+ current-speed: false
+ max-speed: false
+ # 'realtek,i2c-protocol' is meaningful only for the Realtek I2C attachment;
+ # the Broadcom variant and any UART attachment must not carry it.
+ - if:
+ properties:
+ compatible:
+ contains:
+ const: realtek,pse-mcu-rtk
+ required: [reg]
+ then:
+ required:
+ - realtek,i2c-protocol
+ else:
+ properties:
+ "realtek,i2c-protocol": false
+
+unevaluatedProperties: false
+
+examples:
+ # Realtek PSE chip, I2C attachment (SMBus framing).
+ - |
+ i2c {
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ ethernet-pse@20 {
+ compatible = "realtek,pse-mcu-rtk";
+ reg = <0x20>;
+ realtek,i2c-protocol = "smbus";
+
+ pse-pis {
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ pse-pi@0 {
+ reg = <0>;
+ #pse-cells = <0>;
+ };
+ };
+ };
+ };
+
+ # Broadcom PSE chip, I2C attachment.
+ - |
+ i2c {
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ ethernet-pse@20 {
+ compatible = "realtek,pse-mcu-brcm";
+ reg = <0x20>;
+
+ pse-pis {
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ pse-pi@0 {
+ reg = <0>;
+ #pse-cells = <0>;
+ };
+ };
+ };
+ };
+
+ # Realtek PSE chip, UART attachment.
+ - |
+ serial {
+ ethernet-pse {
+ compatible = "realtek,pse-mcu-rtk";
+ current-speed = <115200>;
+
+ pse-pis {
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ pse-pi@0 {
+ reg = <0>;
+ #pse-cells = <0>;
+ };
+ };
+ };
+ };
--
2.51.0
^ permalink raw reply related
* [PATCH net-next v4 2/2] net: pse-pd: add Realtek/Broadcom PSE MCU driver
From: Jonas Jelonek @ 2026-06-30 10:56 UTC (permalink / raw)
To: Oleksij Rempel, Kory Maincent, Andrew Lunn, David S . Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Rob Herring,
Krzysztof Kozlowski, Conor Dooley
Cc: netdev, devicetree, linux-kernel, Daniel Golle, Bjørn Mork,
Jonas Jelonek
In-Reply-To: <20260630105651.756058-1-jelonek.jonas@gmail.com>
A range of PoE switches use a small microcontroller on the PCB to front
the actual PSE silicon. The host CPU talks to that MCU over I2C/SMBus or
UART using a fixed 12-byte request/response protocol with a trailing
checksum; the PSE chips are managed by the MCU and are not accessed
directly. The same protocol family is spoken by Realtek and Broadcom PSE
MCUs, diverging in opcode numbering and a few response layouts, which the
driver abstracts behind a per-dialect opcode table and parser hooks
selected by the compatible. The specific PSE chip behind the MCU is
detected at runtime and only influences per-chip constants (power scaling
and the per-port cap).
The driver is split into a shared core and two transport modules:
- PSE_REALTEK_MCU: protocol, message framing, dialect machinery, and the
pse_controller_ops glue.
- PSE_REALTEK_MCU_I2C / PSE_REALTEK_MCU_UART: transport modules
registering the MCU on an I2C bus or a serdev port respectively.
The realtek-pse-mcu-* files and PSE_REALTEK_MCU* symbols match the
realtek,pse-mcu-rtk / realtek,pse-mcu-brcm compatibles: all name the
Realtek PSE-MCU front-end, not the MCU silicon or the PSE chip behind
it (see the binding for the prefix rationale). Broadcom PSE MCUs speak
the same protocol family and are handled by the same shared core
through the dialect abstraction selected by the '-brcm' compatible.
Power budgeting is left to the MCU firmware; the driver advertises
PSE_BUDGET_EVAL_STRAT_DYNAMIC (controller-managed budget) accordingly.
Signed-off-by: Jonas Jelonek <jelonek.jonas@gmail.com>
---
MAINTAINERS | 7 +
drivers/net/pse-pd/Kconfig | 28 +
drivers/net/pse-pd/Makefile | 3 +
drivers/net/pse-pd/realtek-pse-mcu-core.c | 1019 +++++++++++++++++++++
drivers/net/pse-pd/realtek-pse-mcu-i2c.c | 163 ++++
drivers/net/pse-pd/realtek-pse-mcu-uart.c | 156 ++++
drivers/net/pse-pd/realtek-pse-mcu.h | 87 ++
7 files changed, 1463 insertions(+)
create mode 100644 drivers/net/pse-pd/realtek-pse-mcu-core.c
create mode 100644 drivers/net/pse-pd/realtek-pse-mcu-i2c.c
create mode 100644 drivers/net/pse-pd/realtek-pse-mcu-uart.c
create mode 100644 drivers/net/pse-pd/realtek-pse-mcu.h
diff --git a/MAINTAINERS b/MAINTAINERS
index 15011f5752a9..926c42d3bb93 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -22721,6 +22721,13 @@ S: Maintained
F: include/sound/rt*.h
F: sound/soc/codecs/rt*
+REALTEK/BROADCOM PSE MCU DRIVER
+M: Jonas Jelonek <jelonek.jonas@gmail.com>
+L: netdev@vger.kernel.org
+S: Maintained
+F: Documentation/devicetree/bindings/net/pse-pd/realtek,pse-mcu.yaml
+F: drivers/net/pse-pd/realtek-pse-mcu*
+
REALTEK OTTO WATCHDOG
M: Sander Vanheule <sander@svanheule.net>
L: linux-watchdog@vger.kernel.org
diff --git a/drivers/net/pse-pd/Kconfig b/drivers/net/pse-pd/Kconfig
index 7ef29657ee5d..23e44dde3dbf 100644
--- a/drivers/net/pse-pd/Kconfig
+++ b/drivers/net/pse-pd/Kconfig
@@ -13,6 +13,34 @@ menuconfig PSE_CONTROLLER
if PSE_CONTROLLER
+config PSE_REALTEK_MCU
+ tristate
+ help
+ Shared core for the Realtek/Broadcom PSE MCU driver. This is
+ selected automatically by the transport options below.
+
+config PSE_REALTEK_MCU_I2C
+ tristate "Realtek/Broadcom PSE MCU driver (I2C transport)"
+ depends on I2C
+ select PSE_REALTEK_MCU
+ help
+ Driver for the microcontroller (MCU) that fronts the PSE
+ hardware on switches with Realtek or Broadcom PSE chips, attached
+ via I2C/SMBus. The MCU exposes a message-based protocol; the actual
+ PSE silicon is not accessed directly. To compile this driver as a
+ module, choose M here: the module will be called realtek-pse-mcu-i2c.
+
+config PSE_REALTEK_MCU_UART
+ tristate "Realtek/Broadcom PSE MCU driver (UART transport)"
+ depends on SERIAL_DEV_BUS
+ select PSE_REALTEK_MCU
+ help
+ Driver for the microcontroller (MCU) that fronts the PSE
+ hardware on switches with Realtek or Broadcom PSE chips, attached
+ via UART. The MCU exposes a message-based protocol; the actual PSE
+ silicon is not accessed directly. To compile this driver as a
+ module, choose M here: the module will be called realtek-pse-mcu-uart.
+
config PSE_REGULATOR
tristate "Regulator based PSE controller"
help
diff --git a/drivers/net/pse-pd/Makefile b/drivers/net/pse-pd/Makefile
index cc78f7ea7f5f..9cca5900fe34 100644
--- a/drivers/net/pse-pd/Makefile
+++ b/drivers/net/pse-pd/Makefile
@@ -3,6 +3,9 @@
obj-$(CONFIG_PSE_CONTROLLER) += pse_core.o
+obj-$(CONFIG_PSE_REALTEK_MCU) += realtek-pse-mcu-core.o
+obj-$(CONFIG_PSE_REALTEK_MCU_I2C) += realtek-pse-mcu-i2c.o
+obj-$(CONFIG_PSE_REALTEK_MCU_UART) += realtek-pse-mcu-uart.o
obj-$(CONFIG_PSE_REGULATOR) += pse_regulator.o
obj-$(CONFIG_PSE_PD692X0) += pd692x0.o
obj-$(CONFIG_PSE_SI3474) += si3474.o
diff --git a/drivers/net/pse-pd/realtek-pse-mcu-core.c b/drivers/net/pse-pd/realtek-pse-mcu-core.c
new file mode 100644
index 000000000000..11a0abece37b
--- /dev/null
+++ b/drivers/net/pse-pd/realtek-pse-mcu-core.c
@@ -0,0 +1,1019 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Driver for the microcontroller (MCU) fronting Realtek or Broadcom PSE
+ * chips. Both vendors' MCUs speak a closely related 12-byte fixed-frame
+ * management protocol; this driver covers both via a per-dialect opcode
+ * table and response parsers.
+ *
+ * Many PoE switch designs put a dedicated microcontroller in front of the
+ * actual PSE silicon: the host CPU talks to the MCU over I2C/SMBus or
+ * UART, and the MCU in turn manages the PSE chips on the board. The MCU
+ * speaks a small message-based protocol (12-byte fixed-size frames; opcode
+ * + arg + 9 payload bytes + checksum). The PSE chips themselves are not
+ * accessed directly; everything goes through MCU commands.
+ *
+ * This driver targets that architecture for the Realtek-family protocol.
+ * Two dialects are supported: Realtek MCUs managing RTL823x/RTL8239* PSE
+ * chips, and Broadcom MCUs managing BCM590xx PSE chips. The two share
+ * frame format and a sum-mod-256 checksum but diverge on opcode numbers
+ * and on a few response layouts; this is handled by the per-dialect
+ * opcode table and parser hooks.
+ *
+ * Out of scope: PSE chips that are interfaced directly from the host
+ * without a management MCU, MCU designs that speak an unrelated protocol
+ * family, and "dumb PSE" modes where no host control is wired up at all.
+ * Those, if and when they show up in the kernel, belong in separate
+ * drivers under drivers/net/pse-pd/.
+ *
+ * This core module implements the protocol, decoding/encoding of MCU
+ * responses, and the pse_controller_ops integration. Transport modules
+ * (realtek-pse-i2c, realtek-pse-uart) provide the send/recv callbacks.
+ */
+
+#include <linux/bitfield.h>
+#include <linux/cleanup.h>
+#include <linux/container_of.h>
+#include <linux/delay.h>
+#include <linux/gpio/consumer.h>
+#include <linux/jiffies.h>
+#include <linux/minmax.h>
+#include <linux/mod_devicetable.h>
+#include <linux/module.h>
+#include <linux/property.h>
+#include <linux/pse-pd/pse.h>
+#include <linux/unaligned.h>
+
+#include "realtek-pse-mcu.h"
+
+#define RTPSE_MCU_DEVICE_ID_RTL8238B 0x0138
+#define RTPSE_MCU_DEVICE_ID_RTL8239 0x0039
+#define RTPSE_MCU_DEVICE_ID_RTL8239C 0x0139
+#define RTPSE_MCU_DEVICE_ID_BCM59111 0xe111
+#define RTPSE_MCU_DEVICE_ID_BCM59121 0xe121
+
+#define RTPSE_MCU_PORT_STS_DISABLED 0x00
+#define RTPSE_MCU_PORT_STS_SEARCHING 0x01
+#define RTPSE_MCU_PORT_STS_DELIVERING 0x02
+#define RTPSE_MCU_PORT_STS_FAULT 0x04
+#define RTPSE_MCU_PORT_STS_REQUESTING 0x06
+
+/* RTPSE_MCU_PORT_SET_POWER_LIMIT_TYPE values */
+#define RTPSE_MCU_PORT_PW_LIMIT_TYPE_USER 0x02
+
+#define RTPSE_MCU_MAX_PORTS 48
+#define RTPSE_MCU_PORT_MAX_PRIORITY 3
+
+enum rtpse_mcu_cmd {
+ RTPSE_MCU_CMD_SET_GLOBAL_STATE,
+ RTPSE_MCU_CMD_GET_SYSTEM_INFO,
+ RTPSE_MCU_CMD_GET_EXT_CONFIG,
+
+ RTPSE_MCU_CMD_PORT_ENABLE,
+ RTPSE_MCU_CMD_PORT_SET_POWER_LIMIT_TYPE,
+ RTPSE_MCU_CMD_PORT_SET_POWER_LIMIT,
+ RTPSE_MCU_CMD_PORT_SET_POWER_LIMIT_EXT,
+ RTPSE_MCU_CMD_PORT_SET_PRIORITY,
+ RTPSE_MCU_CMD_PORT_GET_STATUS,
+ RTPSE_MCU_CMD_PORT_GET_POWER_STATS,
+ RTPSE_MCU_CMD_PORT_GET_CONFIG,
+ RTPSE_MCU_CMD_PORT_GET_EXT_CONFIG,
+
+ RTPSE_MCU_NUM_CMDS,
+};
+
+struct rtpse_mcu_opcode {
+ u8 op;
+ bool valid;
+};
+
+/* Shorthand for the designated-initializer entries in dialect opcode tables. */
+#define RTPSE_MCU_OP(opc) { .op = (opc), .valid = true }
+
+/* Forward-declared so dialects can supply response parsers (defined below). */
+struct rtpse_mcu_info;
+struct rtpse_mcu_port_status;
+
+struct rtpse_mcu_dialect {
+ struct rtpse_mcu_opcode opcode[RTPSE_MCU_NUM_CMDS];
+
+ /*
+ * Response parsers. Each dialect must supply its own; the core calls
+ * these unconditionally rather than carrying a default that would
+ * silently mis-decode bytes from a dialect that forgot to set them.
+ */
+ int (*parse_system_info)(const u8 *payload, struct rtpse_mcu_info *info);
+ int (*parse_port_class)(const struct rtpse_mcu_port_status *status);
+ const char *(*mcu_type_str)(unsigned int mcu_type);
+};
+
+/*
+ * Per-compatible match data: selected by the DT/I2C compatible, it bundles
+ * the protocol dialect with attachment quirks that the exact MCU silicon
+ * does not determine (only its firmware protocol and the host bus do).
+ */
+struct rtpse_mcu_match_data {
+ const struct rtpse_mcu_dialect *dialect;
+ /* I2C framing must come from DT (realtek,i2c-protocol); else SMBus. */
+ bool i2c_proto_dt_required;
+};
+
+struct rtpse_mcu_chip_info {
+ const char *name;
+ u16 device_id;
+ u32 max_mW_per_port;
+ enum rtpse_mcu_cmd pw_set_cmd; /* command used by set_pw_limit */
+ u32 pw_set_lsb_mW; /* LSB of pw_set_cmd value, in mW */
+ u32 pw_read_lsb_mW; /* LSB of ext_config.max_power read-back, in mW */
+};
+
+/* Parsed MCU response structures (decoded from rtpse_mcu_msg replies) */
+
+struct rtpse_mcu_info {
+ u8 max_ports;
+ bool system_enable;
+ u16 device_id;
+ u8 sw_ver;
+ u8 mcu_type;
+ u8 config_status;
+ u8 ext_ver;
+};
+
+struct rtpse_mcu_ext_config {
+ u8 uvlo;
+ u8 ovlo;
+ bool prealloc_enable;
+ u8 num_of_pses;
+};
+
+struct rtpse_mcu_port_status {
+ u8 sts1;
+ u8 sts2;
+ u8 sts3;
+};
+
+struct rtpse_mcu_port_measurement {
+ u16 voltage_raw; /* 64.45mV/LSB */
+ u16 current_raw; /* 1mA/LSB */
+ u16 temperature_raw; /* T(mC) = 1250 * (220 - raw) */
+ u16 power_raw; /* 100mW/LSB */
+};
+
+struct rtpse_mcu_port_config {
+ bool enable;
+ u8 function_mode;
+ u8 detection_type;
+ u8 cls_type;
+ u8 disconnect_type;
+ u8 pair_type;
+};
+
+struct rtpse_mcu_port_ext_config {
+ u8 inrush_mode;
+ u8 limit_type;
+ u8 max_power;
+ u8 priority;
+ u8 chip_addr;
+ u8 channel;
+};
+
+static const struct rtpse_mcu_chip_info rtl8238b_info = {
+ .device_id = RTPSE_MCU_DEVICE_ID_RTL8238B,
+ .max_mW_per_port = 30000,
+ .name = "RTL8238B",
+ .pw_read_lsb_mW = 200,
+ .pw_set_cmd = RTPSE_MCU_CMD_PORT_SET_POWER_LIMIT,
+ .pw_set_lsb_mW = 200,
+};
+
+static const struct rtpse_mcu_chip_info rtl8239_info = {
+ .device_id = RTPSE_MCU_DEVICE_ID_RTL8239,
+ .max_mW_per_port = 90000,
+ .name = "RTL8239",
+ .pw_read_lsb_mW = 400,
+ .pw_set_cmd = RTPSE_MCU_CMD_PORT_SET_POWER_LIMIT_EXT,
+ .pw_set_lsb_mW = 400,
+};
+
+static const struct rtpse_mcu_chip_info rtl8239c_info = {
+ .device_id = RTPSE_MCU_DEVICE_ID_RTL8239C,
+ .max_mW_per_port = 90000,
+ .name = "RTL8239C",
+ .pw_read_lsb_mW = 400,
+ .pw_set_cmd = RTPSE_MCU_CMD_PORT_SET_POWER_LIMIT_EXT,
+ .pw_set_lsb_mW = 400,
+};
+
+static const struct rtpse_mcu_chip_info bcm59111_info = {
+ .device_id = RTPSE_MCU_DEVICE_ID_BCM59111,
+ .max_mW_per_port = 30000,
+ .name = "BCM59111",
+ .pw_read_lsb_mW = 200,
+ .pw_set_cmd = RTPSE_MCU_CMD_PORT_SET_POWER_LIMIT,
+ .pw_set_lsb_mW = 200,
+};
+
+static const struct rtpse_mcu_chip_info bcm59121_info = {
+ .device_id = RTPSE_MCU_DEVICE_ID_BCM59121,
+ /*
+ * BCM59121 is a 60W Type-3 part, but known boards run it at 802.3at
+ * and the BCM dialect has only the 8-bit/0.2W set command (<=51W);
+ * cap at the 30W the hardware actually offers.
+ */
+ .max_mW_per_port = 30000,
+ .name = "BCM59121",
+ .pw_read_lsb_mW = 200,
+ .pw_set_cmd = RTPSE_MCU_CMD_PORT_SET_POWER_LIMIT,
+ .pw_set_lsb_mW = 200,
+};
+
+/* Helpers and basic functions */
+
+static struct rtpse_mcu_ctrl *to_rtpse_mcu_ctrl(struct pse_controller_dev *pcdev)
+{
+ return container_of(pcdev, struct rtpse_mcu_ctrl, pcdev);
+}
+
+bool rtpse_mcu_needs_i2c_proto(const struct rtpse_mcu_match_data *match)
+{
+ return match->i2c_proto_dt_required;
+}
+EXPORT_SYMBOL_GPL(rtpse_mcu_needs_i2c_proto);
+
+static void rtpse_mcu_msg_init(struct rtpse_mcu_msg *msg, u8 opcode)
+{
+ memset(msg, 0xff, sizeof(*msg));
+ msg->opcode = opcode;
+}
+
+static u8 rtpse_mcu_checksum(const u8 *buf, size_t len)
+{
+ u8 sum = 0;
+
+ while (len--)
+ sum += *buf++;
+ return sum;
+}
+
+static int rtpse_mcu_do_xfer(struct rtpse_mcu_ctrl *pse, struct rtpse_mcu_msg *req,
+ struct rtpse_mcu_msg *resp)
+{
+ int ret;
+
+ req->checksum = rtpse_mcu_checksum((u8 *)req, RTPSE_MCU_MSG_SIZE - 1);
+
+ scoped_guard(mutex, &pse->mutex) {
+ ret = pse->transport->send(pse, req);
+ if (ret)
+ return ret;
+
+ /*
+ * The MCU needs a fixed amount of time between receiving a request
+ * and having the response ready, regardless of how the bytes get to
+ * us. Pace the transaction here so each transport can keep its recv
+ * path simple: a single bounded wait rather than a generic retry.
+ */
+ msleep(RTPSE_MCU_RESPONSE_MS);
+
+ memset(resp, 0, sizeof(*resp));
+ ret = pse->transport->recv(pse, req, resp);
+ if (ret)
+ return ret;
+ }
+
+ /*
+ * Explicit MCU error opcodes (observed on the BCM dialect; harmless
+ * to check for RTL too). Catch these before the generic opcode/CRC
+ * mismatch path so callers see a meaningful errno.
+ */
+ switch (resp->opcode) {
+ case RTPSE_MCU_OPCODE_INCOMPLETE:
+ return -EBADE;
+ case RTPSE_MCU_OPCODE_BAD_CSUM:
+ return -EBADMSG;
+ case RTPSE_MCU_OPCODE_NOT_READY:
+ return -EAGAIN;
+ }
+
+ if (resp->opcode != req->opcode ||
+ resp->seq_num != req->seq_num ||
+ resp->checksum != rtpse_mcu_checksum((u8 *)resp, RTPSE_MCU_MSG_SIZE - 1))
+ return -EBADMSG;
+
+ return 0;
+}
+
+static int rtpse_mcu_port_query(struct rtpse_mcu_ctrl *pse, unsigned int port, u8 opcode,
+ struct rtpse_mcu_msg *resp)
+{
+ struct rtpse_mcu_msg req;
+ int ret;
+
+ rtpse_mcu_msg_init(&req, opcode);
+ req.payload[0] = port;
+
+ ret = rtpse_mcu_do_xfer(pse, &req, resp);
+ if (ret)
+ return ret;
+
+ if (resp->payload[0] != port)
+ return -EIO;
+
+ return 0;
+}
+
+static int rtpse_mcu_port_cmd(struct rtpse_mcu_ctrl *pse, unsigned int port, u8 opcode, u8 arg)
+{
+ struct rtpse_mcu_msg req, resp;
+ int ret;
+
+ rtpse_mcu_msg_init(&req, opcode);
+ req.payload[0] = port;
+ req.payload[1] = arg;
+
+ ret = rtpse_mcu_do_xfer(pse, &req, &resp);
+ if (ret)
+ return ret;
+
+ if (resp.payload[0] != port || resp.payload[1] != 0)
+ return -EIO;
+
+ return 0;
+}
+
+/* Global operations */
+
+static int rtpse_mcu_get_info(struct rtpse_mcu_ctrl *pse, struct rtpse_mcu_info *info)
+{
+ struct rtpse_mcu_msg req, resp;
+ const struct rtpse_mcu_opcode *opc;
+ int ret;
+
+ opc = &pse->dialect->opcode[RTPSE_MCU_CMD_GET_SYSTEM_INFO];
+ if (!opc->valid)
+ return -EOPNOTSUPP;
+
+ rtpse_mcu_msg_init(&req, opc->op);
+ ret = rtpse_mcu_do_xfer(pse, &req, &resp);
+ if (ret)
+ return ret;
+
+ return pse->dialect->parse_system_info(resp.payload, info);
+}
+
+static int rtpse_mcu_get_ext_config(struct rtpse_mcu_ctrl *pse, struct rtpse_mcu_ext_config *config)
+{
+ struct rtpse_mcu_msg req, resp;
+ const struct rtpse_mcu_opcode *opc;
+ int ret;
+
+ opc = &pse->dialect->opcode[RTPSE_MCU_CMD_GET_EXT_CONFIG];
+ if (!opc->valid)
+ return -EOPNOTSUPP;
+
+ rtpse_mcu_msg_init(&req, opc->op);
+ ret = rtpse_mcu_do_xfer(pse, &req, &resp);
+ if (ret)
+ return ret;
+
+ config->uvlo = resp.payload[0];
+ config->ovlo = resp.payload[5];
+ config->prealloc_enable = (resp.payload[1] == 0x1);
+ config->num_of_pses = resp.payload[6];
+
+ return 0;
+}
+
+static int rtpse_mcu_set_global_state(struct rtpse_mcu_ctrl *pse, bool enable)
+{
+ struct rtpse_mcu_msg req, resp;
+ const struct rtpse_mcu_opcode *opc;
+ int ret;
+
+ opc = &pse->dialect->opcode[RTPSE_MCU_CMD_SET_GLOBAL_STATE];
+ if (!opc->valid)
+ return -EOPNOTSUPP;
+
+ rtpse_mcu_msg_init(&req, opc->op);
+ req.payload[0] = enable ? 0x1 : 0x0;
+
+ ret = rtpse_mcu_do_xfer(pse, &req, &resp);
+ if (ret)
+ return ret;
+
+ return (resp.payload[0] == 0x0) ? 0 : -EIO;
+}
+
+/* Port operations */
+
+static int rtpse_mcu_port_get_status(struct rtpse_mcu_ctrl *pse, unsigned int port,
+ struct rtpse_mcu_port_status *status)
+{
+ const struct rtpse_mcu_opcode *opc;
+ struct rtpse_mcu_msg resp;
+ int ret;
+
+ opc = &pse->dialect->opcode[RTPSE_MCU_CMD_PORT_GET_STATUS];
+ if (!opc->valid)
+ return -EOPNOTSUPP;
+
+ ret = rtpse_mcu_port_query(pse, port, opc->op, &resp);
+ if (ret)
+ return ret;
+
+ status->sts1 = resp.payload[1];
+ status->sts2 = resp.payload[2];
+ status->sts3 = resp.payload[3];
+
+ return 0;
+}
+
+static int rtpse_mcu_port_get_measurement(struct rtpse_mcu_ctrl *pse, unsigned int port,
+ struct rtpse_mcu_port_measurement *measurement)
+{
+ const struct rtpse_mcu_opcode *opc;
+ struct rtpse_mcu_msg resp;
+ int ret;
+
+ opc = &pse->dialect->opcode[RTPSE_MCU_CMD_PORT_GET_POWER_STATS];
+ if (!opc->valid)
+ return -EOPNOTSUPP;
+
+ ret = rtpse_mcu_port_query(pse, port, opc->op, &resp);
+ if (ret)
+ return ret;
+
+ measurement->voltage_raw = get_unaligned_be16(&resp.payload[1]);
+ measurement->current_raw = get_unaligned_be16(&resp.payload[3]);
+ measurement->temperature_raw = get_unaligned_be16(&resp.payload[5]);
+ measurement->power_raw = get_unaligned_be16(&resp.payload[7]);
+
+ return 0;
+}
+
+static int rtpse_mcu_port_get_config(struct rtpse_mcu_ctrl *pse, unsigned int port,
+ struct rtpse_mcu_port_config *config)
+{
+ const struct rtpse_mcu_opcode *opc;
+ struct rtpse_mcu_msg resp;
+ int ret;
+
+ opc = &pse->dialect->opcode[RTPSE_MCU_CMD_PORT_GET_CONFIG];
+ if (!opc->valid)
+ return -EOPNOTSUPP;
+
+ ret = rtpse_mcu_port_query(pse, port, opc->op, &resp);
+ if (ret)
+ return ret;
+
+ config->enable = (resp.payload[1] == 1);
+ config->function_mode = resp.payload[2];
+ config->detection_type = resp.payload[3];
+ config->cls_type = resp.payload[4];
+ config->disconnect_type = resp.payload[5];
+ config->pair_type = resp.payload[6];
+
+ return 0;
+}
+
+static int rtpse_mcu_port_get_ext_config(struct rtpse_mcu_ctrl *pse, unsigned int port,
+ struct rtpse_mcu_port_ext_config *config)
+{
+ const struct rtpse_mcu_opcode *opc;
+ struct rtpse_mcu_msg resp;
+ int ret;
+
+ opc = &pse->dialect->opcode[RTPSE_MCU_CMD_PORT_GET_EXT_CONFIG];
+ if (!opc->valid)
+ return -EOPNOTSUPP;
+
+ ret = rtpse_mcu_port_query(pse, port, opc->op, &resp);
+ if (ret)
+ return ret;
+
+ config->inrush_mode = resp.payload[1];
+ config->limit_type = resp.payload[2];
+ config->max_power = resp.payload[3];
+ config->priority = resp.payload[4];
+ config->chip_addr = resp.payload[5];
+ config->channel = resp.payload[6];
+
+ return 0;
+}
+
+static int rtpse_mcu_port_set_state(struct rtpse_mcu_ctrl *pse, unsigned int port, bool enable)
+{
+ const struct rtpse_mcu_opcode *opc;
+
+ opc = &pse->dialect->opcode[RTPSE_MCU_CMD_PORT_ENABLE];
+ if (!opc->valid)
+ return -EOPNOTSUPP;
+
+ return rtpse_mcu_port_cmd(pse, port, opc->op, enable ? 0x1 : 0x0);
+}
+
+/* PSE controller ops */
+
+static int rtpse_mcu_port_get_admin_state(struct pse_controller_dev *pcdev, int id,
+ struct pse_admin_state *admin_state)
+{
+ struct rtpse_mcu_ctrl *pse = to_rtpse_mcu_ctrl(pcdev);
+ struct rtpse_mcu_port_config config;
+ int ret;
+
+ ret = rtpse_mcu_port_get_config(pse, id, &config);
+ if (ret)
+ return ret;
+
+ admin_state->c33_admin_state = config.enable ? ETHTOOL_C33_PSE_ADMIN_STATE_ENABLED :
+ ETHTOOL_C33_PSE_ADMIN_STATE_DISABLED;
+ return 0;
+}
+
+static int rtpse_mcu_port_get_pw_status(struct pse_controller_dev *pcdev, int id,
+ struct pse_pw_status *pw_status)
+{
+ struct rtpse_mcu_ctrl *pse = to_rtpse_mcu_ctrl(pcdev);
+ struct rtpse_mcu_port_status status;
+ int ret;
+
+ ret = rtpse_mcu_port_get_status(pse, id, &status);
+ if (ret)
+ return ret;
+
+ switch (status.sts1) {
+ case RTPSE_MCU_PORT_STS_DISABLED:
+ pw_status->c33_pw_status = ETHTOOL_C33_PSE_PW_D_STATUS_DISABLED;
+ break;
+ case RTPSE_MCU_PORT_STS_SEARCHING:
+ case RTPSE_MCU_PORT_STS_REQUESTING:
+ pw_status->c33_pw_status = ETHTOOL_C33_PSE_PW_D_STATUS_SEARCHING;
+ break;
+ case RTPSE_MCU_PORT_STS_DELIVERING:
+ pw_status->c33_pw_status = ETHTOOL_C33_PSE_PW_D_STATUS_DELIVERING;
+ break;
+ case RTPSE_MCU_PORT_STS_FAULT:
+ pw_status->c33_pw_status = ETHTOOL_C33_PSE_PW_D_STATUS_FAULT;
+ break;
+ default:
+ pw_status->c33_pw_status = ETHTOOL_C33_PSE_PW_D_STATUS_UNKNOWN;
+ break;
+ }
+
+ return 0;
+}
+
+static int rtpse_mcu_port_get_pw_class(struct pse_controller_dev *pcdev, int id)
+{
+ struct rtpse_mcu_ctrl *pse = to_rtpse_mcu_ctrl(pcdev);
+ struct rtpse_mcu_port_status status;
+ int ret;
+
+ ret = rtpse_mcu_port_get_status(pse, id, &status);
+ if (ret)
+ return ret;
+
+ /*
+ * sts2 carries detection+classification only when sts1 is not a
+ * fault state; in fault states it encodes the fault type instead.
+ * Treat the two reserved sts1 codes (0x3, 0x5) as faults too, since
+ * the datasheet hints at "other fault" beyond the explicit 0x4.
+ */
+ switch (status.sts1) {
+ case RTPSE_MCU_PORT_STS_DISABLED:
+ case RTPSE_MCU_PORT_STS_SEARCHING:
+ case RTPSE_MCU_PORT_STS_DELIVERING:
+ case RTPSE_MCU_PORT_STS_REQUESTING:
+ return pse->dialect->parse_port_class(&status);
+ default:
+ return 0;
+ }
+}
+
+static int rtpse_mcu_port_get_actual_pw(struct pse_controller_dev *pcdev, int id)
+{
+ struct rtpse_mcu_ctrl *pse = to_rtpse_mcu_ctrl(pcdev);
+ struct rtpse_mcu_port_measurement measurement;
+ int ret;
+
+ ret = rtpse_mcu_port_get_measurement(pse, id, &measurement);
+ if (ret)
+ return ret;
+
+ /* 100mW per LSB */
+ return measurement.power_raw * 100U;
+}
+
+static int rtpse_mcu_port_get_voltage(struct pse_controller_dev *pcdev, int id)
+{
+ struct rtpse_mcu_ctrl *pse = to_rtpse_mcu_ctrl(pcdev);
+ struct rtpse_mcu_port_measurement measurement;
+ int ret;
+ u32 uV;
+
+ ret = rtpse_mcu_port_get_measurement(pse, id, &measurement);
+ if (ret)
+ return ret;
+
+ /* 64.45mV per LSB */
+ uV = (u32)measurement.voltage_raw * 64450U;
+ return min_t(u32, uV, INT_MAX);
+}
+
+static int rtpse_mcu_port_enable(struct pse_controller_dev *pcdev, int id)
+{
+ return rtpse_mcu_port_set_state(to_rtpse_mcu_ctrl(pcdev), id, true);
+}
+
+static int rtpse_mcu_port_disable(struct pse_controller_dev *pcdev, int id)
+{
+ return rtpse_mcu_port_set_state(to_rtpse_mcu_ctrl(pcdev), id, false);
+}
+
+static int rtpse_mcu_port_get_pw_limit(struct pse_controller_dev *pcdev, int id)
+{
+ struct rtpse_mcu_ctrl *pse = to_rtpse_mcu_ctrl(pcdev);
+ struct rtpse_mcu_port_ext_config config;
+ int ret;
+
+ ret = rtpse_mcu_port_get_ext_config(pse, id, &config);
+ if (ret)
+ return ret;
+
+ return config.max_power * pse->chip->pw_read_lsb_mW;
+}
+
+static int rtpse_mcu_port_set_pw_limit(struct pse_controller_dev *pcdev, int id, int max_mW)
+{
+ const struct rtpse_mcu_opcode *type_opc, *val_opc;
+ struct rtpse_mcu_ctrl *pse = to_rtpse_mcu_ctrl(pcdev);
+ const struct rtpse_mcu_chip_info *chip = pse->chip;
+ unsigned int prg_val;
+ int ret;
+
+ if (max_mW < 0 || max_mW > chip->max_mW_per_port)
+ return -ERANGE;
+
+ type_opc = &pse->dialect->opcode[RTPSE_MCU_CMD_PORT_SET_POWER_LIMIT_TYPE];
+ val_opc = &pse->dialect->opcode[chip->pw_set_cmd];
+ if (!type_opc->valid || !val_opc->valid)
+ return -EOPNOTSUPP;
+
+ /*
+ * Switch the port to user-defined limit mode first, then program the
+ * limit value. If the second cmd fails, the port is left in
+ * user-defined mode but with the previous limit value; the next
+ * successful set_pw_limit call recovers it.
+ */
+ ret = rtpse_mcu_port_cmd(pse, id, type_opc->op, RTPSE_MCU_PORT_PW_LIMIT_TYPE_USER);
+ if (ret)
+ return ret;
+
+ prg_val = min_t(unsigned int, max_mW / chip->pw_set_lsb_mW, 0xff);
+
+ return rtpse_mcu_port_cmd(pse, id, val_opc->op, prg_val);
+}
+
+static int rtpse_mcu_port_get_pw_limit_ranges(struct pse_controller_dev *pcdev, int id,
+ struct pse_pw_limit_ranges *out)
+{
+ struct ethtool_c33_pse_pw_limit_range *range;
+ struct rtpse_mcu_ctrl *pse = to_rtpse_mcu_ctrl(pcdev);
+
+ range = kzalloc_obj(*range, GFP_KERNEL);
+ if (!range)
+ return -ENOMEM;
+
+ range[0].min = 0;
+ range[0].max = pse->chip->max_mW_per_port;
+
+ out->c33_pw_limit_ranges = range;
+ return 1;
+}
+
+static int rtpse_mcu_port_get_prio(struct pse_controller_dev *pcdev, int id)
+{
+ struct rtpse_mcu_ctrl *pse = to_rtpse_mcu_ctrl(pcdev);
+ struct rtpse_mcu_port_ext_config config;
+ int ret;
+
+ ret = rtpse_mcu_port_get_ext_config(pse, id, &config);
+ if (ret)
+ return ret;
+
+ return config.priority;
+}
+
+static int rtpse_mcu_port_set_prio(struct pse_controller_dev *pcdev, int id, unsigned int prio)
+{
+ struct rtpse_mcu_ctrl *pse = to_rtpse_mcu_ctrl(pcdev);
+ const struct rtpse_mcu_opcode *opc;
+
+ if (prio > RTPSE_MCU_PORT_MAX_PRIORITY)
+ return -ERANGE;
+
+ opc = &pse->dialect->opcode[RTPSE_MCU_CMD_PORT_SET_PRIORITY];
+ if (!opc->valid)
+ return -EOPNOTSUPP;
+
+ return rtpse_mcu_port_cmd(pse, id, opc->op, prio);
+}
+
+static const struct pse_controller_ops rtpse_mcu_ops = {
+ .pi_get_admin_state = rtpse_mcu_port_get_admin_state,
+ .pi_get_pw_status = rtpse_mcu_port_get_pw_status,
+ .pi_get_pw_class = rtpse_mcu_port_get_pw_class,
+ .pi_get_actual_pw = rtpse_mcu_port_get_actual_pw,
+ .pi_enable = rtpse_mcu_port_enable,
+ .pi_disable = rtpse_mcu_port_disable,
+ .pi_get_voltage = rtpse_mcu_port_get_voltage,
+ .pi_get_pw_limit = rtpse_mcu_port_get_pw_limit,
+ .pi_set_pw_limit = rtpse_mcu_port_set_pw_limit,
+ .pi_get_pw_limit_ranges = rtpse_mcu_port_get_pw_limit_ranges,
+ .pi_get_prio = rtpse_mcu_port_get_prio,
+ .pi_set_prio = rtpse_mcu_port_set_prio,
+};
+
+static int rtpse_mcu_discover(struct rtpse_mcu_ctrl *pse, struct rtpse_mcu_info *info)
+{
+ struct rtpse_mcu_ext_config ext_config;
+ unsigned long deadline;
+ int ret;
+
+ /*
+ * The MCU may not answer on the bus yet right after power-up or
+ * enable-gpios assertion: depending on the transport it either stays
+ * silent (-ETIMEDOUT) or does not ACK its address at all (-ENXIO /
+ * -EREMOTEIO). Retry within a bounded wall-time window so a slow boot
+ * still probes, while a genuinely unresponsive MCU fails with its real
+ * error instead of deferring forever and masking it.
+ */
+ deadline = jiffies + msecs_to_jiffies(RTPSE_MCU_BOOT_TIMEOUT_MS);
+ do {
+ ret = rtpse_mcu_get_info(pse, info);
+ if (ret != -ETIMEDOUT && ret != -ENXIO && ret != -EREMOTEIO &&
+ ret != -EAGAIN)
+ break;
+ msleep(RTPSE_MCU_BOOT_RETRY_MS);
+ } while (time_before(jiffies, deadline));
+ if (ret)
+ return dev_err_probe(pse->dev, ret, "failed to read MCU info\n");
+
+ switch (info->device_id) {
+ case RTPSE_MCU_DEVICE_ID_RTL8238B:
+ pse->chip = &rtl8238b_info;
+ break;
+ case RTPSE_MCU_DEVICE_ID_RTL8239:
+ pse->chip = &rtl8239_info;
+ break;
+ case RTPSE_MCU_DEVICE_ID_RTL8239C:
+ pse->chip = &rtl8239c_info;
+ break;
+ case RTPSE_MCU_DEVICE_ID_BCM59111:
+ pse->chip = &bcm59111_info;
+ break;
+ case RTPSE_MCU_DEVICE_ID_BCM59121:
+ pse->chip = &bcm59121_info;
+ break;
+ default:
+ return dev_err_probe(pse->dev, -EINVAL, "unknown PSE id 0x%x\n",
+ info->device_id);
+ }
+
+ if (!info->max_ports || info->max_ports > RTPSE_MCU_MAX_PORTS)
+ return dev_err_probe(pse->dev, -EINVAL,
+ "MCU reports invalid port count %u\n", info->max_ports);
+
+ ret = rtpse_mcu_get_ext_config(pse, &ext_config);
+ if (ret)
+ return dev_err_probe(pse->dev, ret, "failed to read MCU ext config\n");
+
+ dev_info(pse->dev, "%s MCU, %s (id 0x%04x), %u ports across %u PSE chip(s)\n",
+ pse->dialect->mcu_type_str(info->mcu_type), pse->chip->name,
+ info->device_id, info->max_ports, ext_config.num_of_pses);
+ return 0;
+}
+
+static void rtpse_mcu_regulator_disable(void *data)
+{
+ regulator_disable(data);
+}
+
+static void rtpse_mcu_global_disable(void *data)
+{
+ struct rtpse_mcu_ctrl *pse = data;
+
+ rtpse_mcu_set_global_state(pse, false);
+}
+
+int rtpse_mcu_register(struct rtpse_mcu_ctrl *pse)
+{
+ const struct rtpse_mcu_match_data *match;
+ struct gpio_desc *enable_gpio;
+ struct rtpse_mcu_info info;
+ int ret;
+
+ BUILD_BUG_ON(sizeof(struct rtpse_mcu_msg) != RTPSE_MCU_MSG_SIZE);
+
+ ret = devm_mutex_init(pse->dev, &pse->mutex);
+ if (ret)
+ return ret;
+
+ match = device_get_match_data(pse->dev);
+ if (!match)
+ return dev_err_probe(pse->dev, -ENODEV, "missing match data\n");
+ pse->dialect = match->dialect;
+
+ /*
+ * Catch a dialect that forgot to set one of the required hooks at
+ * probe time, rather than NULL-deref'ing later from a fast path.
+ */
+ if (!pse->dialect ||
+ !pse->dialect->parse_system_info ||
+ !pse->dialect->parse_port_class ||
+ !pse->dialect->mcu_type_str)
+ return dev_err_probe(pse->dev, -EINVAL,
+ "dialect for chip is incomplete\n");
+
+ pse->poe_supply = devm_regulator_get(pse->dev, "power");
+ if (IS_ERR(pse->poe_supply))
+ return dev_err_probe(pse->dev, PTR_ERR(pse->poe_supply),
+ "failed to get PoE supply\n");
+
+ enable_gpio = devm_gpiod_get_optional(pse->dev, "enable", GPIOD_OUT_HIGH);
+ if (IS_ERR(enable_gpio))
+ return dev_err_probe(pse->dev, PTR_ERR(enable_gpio),
+ "failed to get enable gpio\n");
+
+ ret = rtpse_mcu_discover(pse, &info);
+ if (ret)
+ return ret;
+
+ ret = regulator_enable(pse->poe_supply);
+ if (ret)
+ return dev_err_probe(pse->dev, ret, "failed to enable PoE supply\n");
+
+ ret = devm_add_action_or_reset(pse->dev, rtpse_mcu_regulator_disable, pse->poe_supply);
+ if (ret)
+ return ret;
+
+ if (!info.system_enable) {
+ ret = rtpse_mcu_set_global_state(pse, true);
+ /* Dialects without a global-state concept (e.g. BCM) return
+ * -EOPNOTSUPP; treat that as "no separate enable required".
+ */
+ if (ret && ret != -EOPNOTSUPP)
+ return dev_err_probe(pse->dev, ret,
+ "failed to enable PSE system\n");
+ if (!ret) {
+ ret = devm_add_action_or_reset(pse->dev,
+ rtpse_mcu_global_disable, pse);
+ if (ret)
+ return ret;
+ }
+ }
+
+ /*
+ * Depending on the MCU firmware configuration (which might be different
+ * for every board), it isn't known whether the PoE subsystem is active or
+ * inactive by default. At this stage, the PSE chips might already deliver
+ * power to PDs without any explicit enable.
+ */
+
+ /* pcdev.owner is set by the transport, so the registered controller
+ * pins the transport module that owns the live device, not the core.
+ */
+ pse->pcdev.ops = &rtpse_mcu_ops;
+ pse->pcdev.dev = pse->dev;
+ pse->pcdev.types = ETHTOOL_PSE_C33;
+ pse->pcdev.nr_lines = info.max_ports;
+ pse->pcdev.pis_prio_max = RTPSE_MCU_PORT_MAX_PRIORITY;
+ pse->pcdev.supp_budget_eval_strategies = PSE_BUDGET_EVAL_STRAT_DYNAMIC;
+
+ return devm_pse_controller_register(pse->dev, &pse->pcdev);
+}
+EXPORT_SYMBOL_GPL(rtpse_mcu_register);
+
+static int rtpse_mcu_rtl_parse_system_info(const u8 *payload, struct rtpse_mcu_info *info)
+{
+ info->max_ports = payload[1];
+ info->system_enable = (payload[2] == 0x1);
+ info->device_id = get_unaligned_be16(&payload[3]);
+ info->sw_ver = payload[5];
+ info->mcu_type = payload[6];
+ info->config_status = payload[7];
+ info->ext_ver = payload[8];
+ return 0;
+}
+
+static int rtpse_mcu_rtl_parse_port_class(const struct rtpse_mcu_port_status *status)
+{
+ /* Class lives in the upper nibble of sts2. */
+ return FIELD_GET(GENMASK(7, 4), status->sts2);
+}
+
+static const char *rtpse_mcu_rtl_mcu_type_str(unsigned int mcu_type)
+{
+ switch (mcu_type) {
+ case 0x00: return "GigaDevice GD32F310";
+ case 0x01: return "GigaDevice GD32F230";
+ case 0x02: return "GigaDevice GD32F303";
+ case 0x03: return "GigaDevice GD32F103";
+ case 0x04: return "GigaDevice GD32E103";
+ case 0x10: return "Nuvoton M0516";
+ case 0x11: return "Nuvoton M0564";
+ case 0x12: return "Nuvoton NUC029";
+ default: return "unknown";
+ }
+}
+
+static int rtpse_mcu_brcm_parse_system_info(const u8 *payload, struct rtpse_mcu_info *info)
+{
+ info->max_ports = payload[1];
+ /* BCM has no explicit system_enable byte; the closest analog is the
+ * "remote enable" bit in the system-status flags at payload[7].
+ */
+ info->system_enable = !!(payload[7] & BIT(2));
+ info->device_id = get_unaligned_be16(&payload[3]);
+ info->sw_ver = payload[5];
+ info->mcu_type = payload[6];
+ info->config_status = payload[7];
+ info->ext_ver = payload[8];
+ return 0;
+}
+
+static int rtpse_mcu_brcm_parse_port_class(const struct rtpse_mcu_port_status *status)
+{
+ /* BCM puts the detected class in payload[3] (== sts3) directly.
+ * Mask to the low nibble; class is 0..8 and any high bits would be
+ * noise.
+ */
+ return status->sts3 & 0x0f;
+}
+
+static const char *rtpse_mcu_brcm_mcu_type_str(unsigned int mcu_type)
+{
+ switch (mcu_type) {
+ case 0x00: return "ST Micro ST32F100";
+ case 0x01: return "Nuvoton M05xx LAN";
+ case 0x02: return "ST Micro STF030C8";
+ case 0x03: return "Nuvoton M058SAN";
+ case 0x04: return "Nuvoton NUC122";
+ default: return "unknown";
+ }
+}
+
+/* Map each logical command the core issues to its per-dialect opcode. */
+static const struct rtpse_mcu_dialect rtpse_mcu_dialect_rtk = {
+ .parse_system_info = rtpse_mcu_rtl_parse_system_info,
+ .parse_port_class = rtpse_mcu_rtl_parse_port_class,
+ .mcu_type_str = rtpse_mcu_rtl_mcu_type_str,
+ .opcode = {
+ [RTPSE_MCU_CMD_SET_GLOBAL_STATE] = RTPSE_MCU_OP(0x00),
+ [RTPSE_MCU_CMD_GET_SYSTEM_INFO] = RTPSE_MCU_OP(0x40),
+ [RTPSE_MCU_CMD_GET_EXT_CONFIG] = RTPSE_MCU_OP(0x4a),
+
+ [RTPSE_MCU_CMD_PORT_ENABLE] = RTPSE_MCU_OP(0x01),
+ [RTPSE_MCU_CMD_PORT_SET_POWER_LIMIT_TYPE] = RTPSE_MCU_OP(0x12),
+ [RTPSE_MCU_CMD_PORT_SET_POWER_LIMIT] = RTPSE_MCU_OP(0x13),
+ [RTPSE_MCU_CMD_PORT_SET_POWER_LIMIT_EXT] = RTPSE_MCU_OP(0x14),
+ [RTPSE_MCU_CMD_PORT_SET_PRIORITY] = RTPSE_MCU_OP(0x15),
+ [RTPSE_MCU_CMD_PORT_GET_STATUS] = RTPSE_MCU_OP(0x42),
+ [RTPSE_MCU_CMD_PORT_GET_POWER_STATS] = RTPSE_MCU_OP(0x44),
+ [RTPSE_MCU_CMD_PORT_GET_CONFIG] = RTPSE_MCU_OP(0x48),
+ [RTPSE_MCU_CMD_PORT_GET_EXT_CONFIG] = RTPSE_MCU_OP(0x49),
+ },
+};
+
+static const struct rtpse_mcu_dialect rtpse_mcu_dialect_brcm = {
+ .parse_system_info = rtpse_mcu_brcm_parse_system_info,
+ .parse_port_class = rtpse_mcu_brcm_parse_port_class,
+ .mcu_type_str = rtpse_mcu_brcm_mcu_type_str,
+ .opcode = {
+ [RTPSE_MCU_CMD_GET_SYSTEM_INFO] = RTPSE_MCU_OP(0x20),
+ [RTPSE_MCU_CMD_GET_EXT_CONFIG] = RTPSE_MCU_OP(0x2b),
+
+ [RTPSE_MCU_CMD_PORT_ENABLE] = RTPSE_MCU_OP(0x00),
+ [RTPSE_MCU_CMD_PORT_SET_POWER_LIMIT_TYPE] = RTPSE_MCU_OP(0x15),
+ [RTPSE_MCU_CMD_PORT_SET_POWER_LIMIT] = RTPSE_MCU_OP(0x16),
+ [RTPSE_MCU_CMD_PORT_SET_PRIORITY] = RTPSE_MCU_OP(0x1a),
+ [RTPSE_MCU_CMD_PORT_GET_STATUS] = RTPSE_MCU_OP(0x21),
+ [RTPSE_MCU_CMD_PORT_GET_POWER_STATS] = RTPSE_MCU_OP(0x30),
+ [RTPSE_MCU_CMD_PORT_GET_CONFIG] = RTPSE_MCU_OP(0x25),
+ [RTPSE_MCU_CMD_PORT_GET_EXT_CONFIG] = RTPSE_MCU_OP(0x26),
+ },
+};
+
+const struct rtpse_mcu_match_data rtpse_mcu_rtk_data = {
+ .dialect = &rtpse_mcu_dialect_rtk,
+ .i2c_proto_dt_required = true,
+};
+EXPORT_SYMBOL_GPL(rtpse_mcu_rtk_data);
+
+const struct rtpse_mcu_match_data rtpse_mcu_brcm_data = {
+ .dialect = &rtpse_mcu_dialect_brcm,
+};
+EXPORT_SYMBOL_GPL(rtpse_mcu_brcm_data);
+
+MODULE_DESCRIPTION("Realtek/Broadcom PSE MCU driver (core)");
+MODULE_AUTHOR("Jonas Jelonek <jelonek.jonas@gmail.com>");
+MODULE_LICENSE("GPL");
diff --git a/drivers/net/pse-pd/realtek-pse-mcu-i2c.c b/drivers/net/pse-pd/realtek-pse-mcu-i2c.c
new file mode 100644
index 000000000000..6e6e3645c509
--- /dev/null
+++ b/drivers/net/pse-pd/realtek-pse-mcu-i2c.c
@@ -0,0 +1,163 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include <linux/delay.h>
+#include <linux/i2c.h>
+#include <linux/module.h>
+#include <linux/of.h>
+#include <linux/pse-pd/pse.h>
+
+#include "realtek-pse-mcu.h"
+
+/*
+ * The core has already waited RTPSE_MCU_RESPONSE_MS before calling us, so
+ * the response is normally ready on the very first read. For commands the
+ * MCU produces more slowly, keep polling at the typical response cadence
+ * up to the worst-case ceiling.
+ */
+#define RTPSE_MCU_I2C_RETRY_MS RTPSE_MCU_RESPONSE_MS
+#define RTPSE_MCU_I2C_MAX_TRIES (RTPSE_MCU_RESPONSE_MAX_MS / RTPSE_MCU_I2C_RETRY_MS)
+
+static int rtpse_mcu_i2c_smbus_send(struct rtpse_mcu_ctrl *pse, const struct rtpse_mcu_msg *req)
+{
+ struct i2c_client *client = to_i2c_client(pse->dev);
+
+ /* Send opcode as SMBus command byte; remaining 11 bytes as block data */
+ return i2c_smbus_write_i2c_block_data(client, req->opcode, RTPSE_MCU_MSG_SIZE - 1,
+ (u8 *)req + 1);
+}
+
+static int rtpse_mcu_i2c_smbus_recv(struct rtpse_mcu_ctrl *pse, const struct rtpse_mcu_msg *req,
+ struct rtpse_mcu_msg *resp)
+{
+ struct i2c_client *client = to_i2c_client(pse->dev);
+ int tries, ret;
+
+ for (tries = 0; tries < RTPSE_MCU_I2C_MAX_TRIES; tries++) {
+ if (tries > 0)
+ msleep(RTPSE_MCU_I2C_RETRY_MS);
+
+ /* MCU needs 0x00 as command byte for read */
+ ret = i2c_smbus_read_i2c_block_data(client, 0x00,
+ RTPSE_MCU_MSG_SIZE,
+ (u8 *)resp);
+ if (ret < 0)
+ return ret;
+ if (ret == RTPSE_MCU_MSG_SIZE && rtpse_mcu_resp_is_final(req, resp))
+ return 0;
+ }
+
+ return -ETIMEDOUT;
+}
+
+static const struct rtpse_mcu_transport_ops rtpse_mcu_i2c_smbus_ops = {
+ .send = rtpse_mcu_i2c_smbus_send,
+ .recv = rtpse_mcu_i2c_smbus_recv,
+};
+
+static int rtpse_mcu_i2c_native_send(struct rtpse_mcu_ctrl *pse, const struct rtpse_mcu_msg *req)
+{
+ struct i2c_client *client = to_i2c_client(pse->dev);
+ int ret;
+
+ ret = i2c_master_send(client, (const u8 *)req, RTPSE_MCU_MSG_SIZE);
+ if (ret < 0)
+ return ret;
+ return ret == RTPSE_MCU_MSG_SIZE ? 0 : -EIO;
+}
+
+static int rtpse_mcu_i2c_native_recv(struct rtpse_mcu_ctrl *pse, const struct rtpse_mcu_msg *req,
+ struct rtpse_mcu_msg *resp)
+{
+ struct i2c_client *client = to_i2c_client(pse->dev);
+ int tries, ret;
+
+ for (tries = 0; tries < RTPSE_MCU_I2C_MAX_TRIES; tries++) {
+ if (tries > 0)
+ msleep(RTPSE_MCU_I2C_RETRY_MS);
+
+ ret = i2c_master_recv(client, (u8 *)resp, RTPSE_MCU_MSG_SIZE);
+ if (ret < 0)
+ return ret;
+ if (ret == RTPSE_MCU_MSG_SIZE && rtpse_mcu_resp_is_final(req, resp))
+ return 0;
+ }
+
+ return -ETIMEDOUT;
+}
+
+static const struct rtpse_mcu_transport_ops rtpse_mcu_i2c_native_ops = {
+ .send = rtpse_mcu_i2c_native_send,
+ .recv = rtpse_mcu_i2c_native_recv,
+};
+
+static int rtpse_mcu_i2c_probe(struct i2c_client *client)
+{
+ struct device *dev = &client->dev;
+ const struct rtpse_mcu_match_data *match;
+ struct rtpse_mcu_ctrl *pse;
+ bool use_native = false;
+ int ret;
+
+ match = device_get_match_data(dev);
+ if (!match)
+ return dev_err_probe(dev, -ENODEV, "missing match data\n");
+
+ if (rtpse_mcu_needs_i2c_proto(match)) {
+ const char *proto;
+
+ ret = device_property_read_string(dev, "realtek,i2c-protocol", &proto);
+ if (ret)
+ return dev_err_probe(dev, ret,
+ "missing required \"realtek,i2c-protocol\" property\n");
+
+ if (!strcmp(proto, "i2c"))
+ use_native = true;
+ else if (!strcmp(proto, "smbus"))
+ use_native = false;
+ else
+ return dev_err_probe(dev, -EINVAL,
+ "unknown realtek,i2c-protocol \"%s\"\n", proto);
+ }
+
+ if (use_native) {
+ if (!i2c_check_functionality(client->adapter, I2C_FUNC_I2C))
+ return dev_err_probe(dev, -EOPNOTSUPP,
+ "plain-I2C MCU protocol requires I2C-capable adapter\n");
+ } else {
+ if (!i2c_check_functionality(client->adapter,
+ I2C_FUNC_SMBUS_WRITE_I2C_BLOCK |
+ I2C_FUNC_SMBUS_READ_I2C_BLOCK))
+ return dev_err_probe(dev, -EOPNOTSUPP,
+ "SMBus MCU protocol requires SMBus I2C-block support\n");
+ }
+
+ pse = devm_kzalloc(dev, sizeof(*pse), GFP_KERNEL);
+ if (!pse)
+ return -ENOMEM;
+
+ pse->dev = dev;
+ pse->pcdev.owner = THIS_MODULE;
+ pse->transport = use_native ? &rtpse_mcu_i2c_native_ops : &rtpse_mcu_i2c_smbus_ops;
+
+ return rtpse_mcu_register(pse);
+}
+
+static const struct of_device_id rtpse_mcu_i2c_of_match[] = {
+ { .compatible = "realtek,pse-mcu-rtk", .data = &rtpse_mcu_rtk_data },
+ { .compatible = "realtek,pse-mcu-brcm", .data = &rtpse_mcu_brcm_data },
+ { /* sentinel */ }
+};
+MODULE_DEVICE_TABLE(of, rtpse_mcu_i2c_of_match);
+
+static struct i2c_driver rtpse_mcu_i2c_driver = {
+ .driver = {
+ .name = "realtek-pse-mcu-i2c",
+ .of_match_table = rtpse_mcu_i2c_of_match,
+ },
+ .probe = rtpse_mcu_i2c_probe,
+};
+module_i2c_driver(rtpse_mcu_i2c_driver);
+
+MODULE_AUTHOR("Jonas Jelonek <jelonek.jonas@gmail.com>");
+MODULE_DESCRIPTION("Realtek/Broadcom PSE MCU driver (I2C transport)");
+MODULE_LICENSE("GPL");
diff --git a/drivers/net/pse-pd/realtek-pse-mcu-uart.c b/drivers/net/pse-pd/realtek-pse-mcu-uart.c
new file mode 100644
index 000000000000..ef04e0d92963
--- /dev/null
+++ b/drivers/net/pse-pd/realtek-pse-mcu-uart.c
@@ -0,0 +1,156 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include <linux/cleanup.h>
+#include <linux/completion.h>
+#include <linux/delay.h>
+#include <linux/module.h>
+#include <linux/of.h>
+#include <linux/pse-pd/pse.h>
+#include <linux/serdev.h>
+#include <linux/spinlock.h>
+#include <linux/string.h>
+
+#include "realtek-pse-mcu.h"
+
+#define RTPSE_MCU_UART_BAUD_DEFAULT 19200
+#define RTPSE_MCU_UART_TX_TIMEOUT msecs_to_jiffies(100)
+#define RTPSE_MCU_UART_RX_TIMEOUT msecs_to_jiffies(RTPSE_MCU_RESPONSE_MAX_MS)
+
+struct rtpse_mcu_uart {
+ struct rtpse_mcu_ctrl pse;
+ struct serdev_device *serdev;
+ struct completion rx_done;
+ spinlock_t rx_lock; /* protects rx_buf and rx_len */
+ size_t rx_len;
+ u8 rx_buf[RTPSE_MCU_MSG_SIZE];
+};
+
+#define to_rtpse_mcu_uart(p) container_of(p, struct rtpse_mcu_uart, pse)
+
+/*
+ * No framing is done here: a glitched frame costs one transaction, then
+ * the next _send re-frames from rx_len 0. Resync works by returning count
+ * (not take), dropping any overflow so serdev keeps no leftover to bleed
+ * into the next frame.
+ */
+static size_t rtpse_mcu_uart_receive(struct serdev_device *serdev,
+ const u8 *buf, size_t count)
+{
+ struct rtpse_mcu_uart *ctx = serdev_device_get_drvdata(serdev);
+ size_t take;
+
+ scoped_guard(spinlock_irqsave, &ctx->rx_lock) {
+ take = min(count, sizeof(ctx->rx_buf) - ctx->rx_len);
+ if (take) {
+ memcpy(ctx->rx_buf + ctx->rx_len, buf, take);
+ ctx->rx_len += take;
+ if (ctx->rx_len == sizeof(ctx->rx_buf))
+ complete(&ctx->rx_done);
+ }
+ }
+
+ /* consume all to avoid desync/misalignment */
+ return count;
+}
+
+static const struct serdev_device_ops rtpse_mcu_uart_serdev_ops = {
+ .receive_buf = rtpse_mcu_uart_receive,
+ .write_wakeup = serdev_device_write_wakeup,
+};
+
+static int rtpse_mcu_uart_send(struct rtpse_mcu_ctrl *pse, const struct rtpse_mcu_msg *req)
+{
+ struct rtpse_mcu_uart *ctx = to_rtpse_mcu_uart(pse);
+ int written;
+
+ /* clear any leftover rx state before transmitting */
+ scoped_guard(spinlock_irqsave, &ctx->rx_lock) {
+ reinit_completion(&ctx->rx_done);
+ ctx->rx_len = 0;
+ }
+
+ written = serdev_device_write(ctx->serdev, (const u8 *)req, sizeof(*req),
+ RTPSE_MCU_UART_TX_TIMEOUT);
+ if (written < 0)
+ return written;
+ if (written != sizeof(*req))
+ return -EIO;
+
+ return 0;
+}
+
+static int rtpse_mcu_uart_recv(struct rtpse_mcu_ctrl *pse,
+ const struct rtpse_mcu_msg *req,
+ struct rtpse_mcu_msg *resp)
+{
+ struct rtpse_mcu_uart *ctx = to_rtpse_mcu_uart(pse);
+
+ if (!wait_for_completion_timeout(&ctx->rx_done, RTPSE_MCU_UART_RX_TIMEOUT))
+ return -ETIMEDOUT;
+
+ scoped_guard(spinlock_irqsave, &ctx->rx_lock) {
+ if (ctx->rx_len != sizeof(*resp))
+ return -EIO;
+
+ memcpy(resp, ctx->rx_buf, sizeof(*resp));
+ }
+ return 0;
+}
+
+static const struct rtpse_mcu_transport_ops rtpse_mcu_uart_transport_ops = {
+ .send = rtpse_mcu_uart_send,
+ .recv = rtpse_mcu_uart_recv,
+};
+
+static int rtpse_mcu_uart_probe(struct serdev_device *serdev)
+{
+ u32 speed = RTPSE_MCU_UART_BAUD_DEFAULT;
+ struct device *dev = &serdev->dev;
+ struct rtpse_mcu_uart *ctx;
+ int ret;
+
+ ctx = devm_kzalloc(dev, sizeof(*ctx), GFP_KERNEL);
+ if (!ctx)
+ return -ENOMEM;
+
+ ctx->serdev = serdev;
+ ctx->pse.dev = dev;
+ ctx->pse.pcdev.owner = THIS_MODULE;
+ ctx->pse.transport = &rtpse_mcu_uart_transport_ops;
+ init_completion(&ctx->rx_done);
+ spin_lock_init(&ctx->rx_lock);
+
+ serdev_device_set_drvdata(serdev, ctx);
+ serdev_device_set_client_ops(serdev, &rtpse_mcu_uart_serdev_ops);
+
+ ret = devm_serdev_device_open(dev, serdev);
+ if (ret)
+ return dev_err_probe(dev, ret, "failed to open serdev\n");
+
+ fwnode_property_read_u32(dev_fwnode(dev), "current-speed", &speed);
+ serdev_device_set_baudrate(serdev, speed);
+ serdev_device_set_flow_control(serdev, false);
+ serdev_device_set_parity(serdev, SERDEV_PARITY_NONE);
+
+ return rtpse_mcu_register(&ctx->pse);
+}
+
+static const struct of_device_id rtpse_mcu_uart_of_match[] = {
+ { .compatible = "realtek,pse-mcu-rtk", .data = &rtpse_mcu_rtk_data },
+ { .compatible = "realtek,pse-mcu-brcm", .data = &rtpse_mcu_brcm_data },
+ { /* sentinel */ }
+};
+MODULE_DEVICE_TABLE(of, rtpse_mcu_uart_of_match);
+
+static struct serdev_device_driver rtpse_mcu_uart_driver = {
+ .driver = {
+ .name = "realtek-pse-mcu-uart",
+ .of_match_table = rtpse_mcu_uart_of_match,
+ },
+ .probe = rtpse_mcu_uart_probe,
+};
+module_serdev_device_driver(rtpse_mcu_uart_driver);
+
+MODULE_AUTHOR("Jonas Jelonek <jelonek.jonas@gmail.com>");
+MODULE_DESCRIPTION("Realtek/Broadcom PSE MCU driver (UART transport)");
+MODULE_LICENSE("GPL");
diff --git a/drivers/net/pse-pd/realtek-pse-mcu.h b/drivers/net/pse-pd/realtek-pse-mcu.h
new file mode 100644
index 000000000000..b9bf3b2dde08
--- /dev/null
+++ b/drivers/net/pse-pd/realtek-pse-mcu.h
@@ -0,0 +1,87 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
+#ifndef _REALTEK_PSE_MCU_H
+#define _REALTEK_PSE_MCU_H
+
+#include <linux/mutex.h>
+#include <linux/pse-pd/pse.h>
+#include <linux/types.h>
+
+/*
+ * Time the MCU itself needs between accepting a request and having a
+ * response ready. These are properties of the MCU firmware, not of the
+ * underlying transport: the core paces transactions by RTPSE_MCU_RESPONSE_MS
+ * and both transports size their per-transaction recv ceiling from
+ * RTPSE_MCU_RESPONSE_MAX_MS, since some commands are documented as
+ * needing up to ~1s to produce a reply.
+ */
+#define RTPSE_MCU_RESPONSE_MS 25
+#define RTPSE_MCU_RESPONSE_MAX_MS 1000
+
+/*
+ * Total time to keep retrying the first MCU read at probe, and the pause
+ * between attempts. Right after enable-gpios is asserted the MCU may not
+ * answer on the bus yet; give it a bounded window to come up before
+ * declaring the probe failed.
+ */
+#define RTPSE_MCU_BOOT_TIMEOUT_MS 3000
+#define RTPSE_MCU_BOOT_RETRY_MS 100
+
+#define RTPSE_MCU_MSG_SIZE 12
+
+struct rtpse_mcu_msg {
+ u8 opcode;
+ u8 seq_num;
+ u8 payload[9];
+ u8 checksum;
+} __packed;
+
+/*
+ * MCU status opcodes (seen on the BCM dialect; RTL never emits them).
+ * INCOMPLETE/BAD_CSUM are terminal; NOT_READY is transient.
+ */
+#define RTPSE_MCU_OPCODE_INCOMPLETE 0xfd /* -EBADE */
+#define RTPSE_MCU_OPCODE_BAD_CSUM 0xfe /* -EBADMSG */
+#define RTPSE_MCU_OPCODE_NOT_READY 0xff /* -EAGAIN */
+
+/* A polling transport can stop here: the matching reply, or a terminal error. */
+static inline bool rtpse_mcu_resp_is_final(const struct rtpse_mcu_msg *req,
+ const struct rtpse_mcu_msg *resp)
+{
+ return resp->opcode == req->opcode ||
+ resp->opcode == RTPSE_MCU_OPCODE_INCOMPLETE ||
+ resp->opcode == RTPSE_MCU_OPCODE_BAD_CSUM;
+}
+
+/* Opaque to transports; defined in realtek-pse-core.c. */
+struct rtpse_mcu_dialect;
+struct rtpse_mcu_match_data;
+struct rtpse_mcu_chip_info;
+struct rtpse_mcu_ctrl;
+
+struct rtpse_mcu_transport_ops {
+ int (*send)(struct rtpse_mcu_ctrl *pse, const struct rtpse_mcu_msg *req);
+ int (*recv)(struct rtpse_mcu_ctrl *pse, const struct rtpse_mcu_msg *req,
+ struct rtpse_mcu_msg *resp);
+};
+
+struct rtpse_mcu_ctrl {
+ struct device *dev;
+ struct pse_controller_dev pcdev;
+ struct mutex mutex; /* serializes MCU request/response transactions */
+ const struct rtpse_mcu_dialect *dialect;
+ const struct rtpse_mcu_chip_info *chip;
+ const struct rtpse_mcu_transport_ops *transport;
+
+ struct regulator *poe_supply;
+};
+
+int rtpse_mcu_register(struct rtpse_mcu_ctrl *pse);
+
+/* Whether the I2C transport must read "realtek,i2c-protocol" from DT. */
+bool rtpse_mcu_needs_i2c_proto(const struct rtpse_mcu_match_data *match);
+
+extern const struct rtpse_mcu_match_data rtpse_mcu_rtk_data;
+extern const struct rtpse_mcu_match_data rtpse_mcu_brcm_data;
+
+#endif
--
2.51.0
^ permalink raw reply related
* Re: [PATCH v3 0/2] ptp: split non-host-disciplined PHC drivers into a dedicated subdirectory
From: David Woodhouse @ 2026-06-30 10:57 UTC (permalink / raw)
To: Wen Gu, richardcochran, andrew+netdev, davem, edumazet, kuba,
pabeni, netdev, linux-kernel
Cc: svens, nick.shi, ajay.kaher, alexey.makhalov,
bcm-kernel-feedback-list, linux-s390, xuanzhuo, dust.li, mani,
imran.shaik
In-Reply-To: <20260630031519.23072-1-guwen@linux.alibaba.com>
[-- Attachment #1: Type: text/plain, Size: 1941 bytes --]
On Tue, 2026-06-30 at 11:15 +0800, Wen Gu wrote:
> # Proposal
>
> This series makes the separation explicit by reorganizing the drivers/ptp/
> layout into the following groups:
>
> - drivers/ptp/ : PTP core infrastructure and host-disciplined
> PHC drivers.
>
> - drivers/ptp/emulated/ : non-host-disciplined PHC drivers that expose
> precision time from hypervisors, platforms,
> or firmware. These clocks are read-only and
> not adjusted by the host.
I was thinking we'd move them to drivers/phc, and simplify them as we do.
Most of them are just a lot of PTP driver boilerplate, wrapping around
one central function like
static int vmclock_get_crosststamp(struct vmclock_state *st,
struct ptp_system_timestamp *sts,
struct system_counterval_t *system_counter,
struct timespec64 *tspec)
...which is called with different permutations of arguments depending
on the actual PTP call.
I was thinking of reducing the duplication and having the PHC drivers
provide *only* that central function. Let the common PHC code provide
the interface to PTP (as well as to core timekeeping, for setting the
clock at boot, for timekeeping_set_reference() in the vmclock case, and
perhaps even for a PPS-like discipline from other clocks).
Here's a *very* hastily thrown together proof of concept; utterly
untested and AI-produced, and I've only given it the bare minimum of
oversight thus far (I have been meaning to do this for weeks but other
things have taken precedence so far)...
https://git.infradead.org/?p=linux-phc.git;a=shortlog;h=refs/heads/next
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* [PATCH net-next 0/8] drivers/net: replace __get_free_pages() with kmalloc()
From: Mike Rapoport (Microsoft) @ 2026-06-30 10:59 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: Brian Norris, Edward Cree, Francesco Dolcini, Manish Chopra,
Mike Rapoport, Przemek Kitszel, Sudarsana Kalluru, Tony Nguyen,
b43-dev, intel-wired-lan, libertas-dev, linux-kernel, linux-mm,
linux-net-drivers, linux-wireless, netdev
This is a (small) part of larger work of replacing page allocator calls
with kmalloc.
My initial intention a few month ago was to remove ugly casts [1], but then
willy pointed out that Linus objected to something like this [2] and it
looks like more than a decade old technical debt.
Largely, anything that doesn't need struct page (or a memdesc in the
future) should just use kmalloc() or kvmalloc() to allocate memory.
kmalloc() guarantees alignment, physical contiguity and working
virt_to_phys() and beside nicer API that returns void * on alloc and
doesn't require to know the allocation size on free, kmalloc() provides
better debugging capabilities than page allocator.
Another thing is that touching these allocation sites gives the reviewers
opportunity to see if a PAGE_SIZE buffer is actually needed or maybe
another size is appropriate.
For larger allocations that don't need physically contiguous memory
kvmalloc() can be a better option that __get_free_pages() because under
memory pressure it's is easier to allocate several order-0 pages than a
physically contiguous chunk with the same number of pages.
And last, but not least, removing needless calls to page allocator should
help with memdesc (aka project folio) conversion. There will be way less
places to audit to see if the user was actually using struct page.
Also in git:
https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git gfp-to-kmalloc/drivers-net
[1] https://lore.kernel.org/all/20251018093002.3660549-1-rppt@kernel.org/
[2] https://lore.kernel.org/all/CA+55aFwp4iy4rtX2gE2WjBGFL=NxMVnoFeHqYa2j1dYOMMGqxg@mail.gmail.com/
---
Mike Rapoport (Microsoft) (8):
b43, b43legacy: debugfs: use kmalloc() to allocate formatting buffers
bnx2x: use kzalloc() to allocate mac filtering list
ice: use kzalloc() to allocate staging buffer for reading from GNSS
libertas: debugfs: use kzalloc() to allocate formatting buffers
mwifiex: debugfs: use kzalloc() to allocate formatting buffers
sfc/siena: use kmalloc() to allocate logging buffer
sfc: use kmalloc() to allocate logging buffer
wlcore: allocate aggregation and firmware log buffers with kzalloc()
drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c | 6 +--
drivers/net/ethernet/intel/ice/ice_gnss.c | 5 +-
drivers/net/ethernet/sfc/mcdi.c | 7 +--
drivers/net/ethernet/sfc/siena/mcdi.c | 7 +--
drivers/net/wireless/broadcom/b43/debugfs.c | 12 ++---
drivers/net/wireless/broadcom/b43legacy/debugfs.c | 11 ++--
drivers/net/wireless/marvell/libertas/debugfs.c | 39 ++++++--------
drivers/net/wireless/marvell/mwifiex/debugfs.c | 62 ++++++++++-------------
drivers/net/wireless/ti/wlcore/main.c | 14 +++--
9 files changed, 73 insertions(+), 90 deletions(-)
---
base-commit: dc59e4fea9d83f03bad6bddf3fa2e52491777482
change-id: 20260616-b4-drivers-net-fc7f5b7e0a4c
Best regards,
--
Sincerely yours,
Mike.
^ permalink raw reply
* [PATCH net-next 1/8] b43, b43legacy: debugfs: use kmalloc() to allocate formatting buffers
From: Mike Rapoport (Microsoft) @ 2026-06-30 10:59 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: Brian Norris, Edward Cree, Francesco Dolcini, Manish Chopra,
Mike Rapoport, Przemek Kitszel, Sudarsana Kalluru, Tony Nguyen,
b43-dev, intel-wired-lan, libertas-dev, linux-kernel, linux-mm,
linux-net-drivers, linux-wireless, netdev
In-Reply-To: <20260630-b4-drivers-net-v1-0-672162a91f37@kernel.org>
b43* debugfs functions allocate 16 KiB buffers for formatting debug output
text using __get_free_pages().
kmalloc() provides a better API that does not require ugly casts and
kfree() does not need to know the size of the freed object and for 16 Kib
allocation kmalloc() will anyway delegate it to buddy.
Replace use of __get_free_pages() with kmalloc().
Link: https://lore.kernel.org/all/635405e4-9423-4a25-a6e7-e03c8ea0bcbe@redhat.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
drivers/net/wireless/broadcom/b43/debugfs.c | 12 +++++-------
drivers/net/wireless/broadcom/b43legacy/debugfs.c | 11 +++++------
2 files changed, 10 insertions(+), 13 deletions(-)
diff --git a/drivers/net/wireless/broadcom/b43/debugfs.c b/drivers/net/wireless/broadcom/b43/debugfs.c
index acddae68947a..31a1ff00c1a4 100644
--- a/drivers/net/wireless/broadcom/b43/debugfs.c
+++ b/drivers/net/wireless/broadcom/b43/debugfs.c
@@ -495,7 +495,6 @@ static ssize_t b43_debugfs_read(struct file *file, char __user *userbuf,
ssize_t ret;
char *buf;
const size_t bufsize = 1024 * 16; /* 16 kiB buffer */
- const size_t buforder = get_order(bufsize);
int err = 0;
if (!count)
@@ -518,15 +517,14 @@ static ssize_t b43_debugfs_read(struct file *file, char __user *userbuf,
dfile = fops_to_dfs_file(dev, dfops);
if (!dfile->buffer) {
- buf = (char *)__get_free_pages(GFP_KERNEL, buforder);
+ buf = kzalloc(bufsize, GFP_KERNEL);
if (!buf) {
err = -ENOMEM;
goto out_unlock;
}
- memset(buf, 0, bufsize);
ret = dfops->read(dev, buf, bufsize);
if (ret <= 0) {
- free_pages((unsigned long)buf, buforder);
+ kfree(buf);
err = ret;
goto out_unlock;
}
@@ -538,7 +536,7 @@ static ssize_t b43_debugfs_read(struct file *file, char __user *userbuf,
dfile->buffer,
dfile->data_len);
if (*ppos >= dfile->data_len) {
- free_pages((unsigned long)dfile->buffer, buforder);
+ kfree(dfile->buffer);
dfile->buffer = NULL;
dfile->data_len = 0;
}
@@ -577,7 +575,7 @@ static ssize_t b43_debugfs_write(struct file *file,
goto out_unlock;
}
- buf = (char *)get_zeroed_page(GFP_KERNEL);
+ buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
if (!buf) {
err = -ENOMEM;
goto out_unlock;
@@ -591,7 +589,7 @@ static ssize_t b43_debugfs_write(struct file *file,
goto out_freepage;
out_freepage:
- free_page((unsigned long)buf);
+ kfree(buf);
out_unlock:
mutex_unlock(&dev->wl->mutex);
diff --git a/drivers/net/wireless/broadcom/b43legacy/debugfs.c b/drivers/net/wireless/broadcom/b43legacy/debugfs.c
index 3ad99124d522..42cce5e0402d 100644
--- a/drivers/net/wireless/broadcom/b43legacy/debugfs.c
+++ b/drivers/net/wireless/broadcom/b43legacy/debugfs.c
@@ -192,7 +192,6 @@ static ssize_t b43legacy_debugfs_read(struct file *file, char __user *userbuf,
ssize_t ret;
char *buf;
const size_t bufsize = 1024 * 16; /* 16 KiB buffer */
- const size_t buforder = get_order(bufsize);
int err = 0;
if (!count)
@@ -215,7 +214,7 @@ static ssize_t b43legacy_debugfs_read(struct file *file, char __user *userbuf,
dfile = fops_to_dfs_file(dev, dfops);
if (!dfile->buffer) {
- buf = (char *)__get_free_pages(GFP_KERNEL, buforder);
+ buf = kmalloc(bufsize, GFP_KERNEL);
if (!buf) {
err = -ENOMEM;
goto out_unlock;
@@ -228,7 +227,7 @@ static ssize_t b43legacy_debugfs_read(struct file *file, char __user *userbuf,
} else
ret = dfops->read(dev, buf, bufsize);
if (ret <= 0) {
- free_pages((unsigned long)buf, buforder);
+ kfree(buf);
err = ret;
goto out_unlock;
}
@@ -240,7 +239,7 @@ static ssize_t b43legacy_debugfs_read(struct file *file, char __user *userbuf,
dfile->buffer,
dfile->data_len);
if (*ppos >= dfile->data_len) {
- free_pages((unsigned long)dfile->buffer, buforder);
+ kfree(dfile->buffer);
dfile->buffer = NULL;
dfile->data_len = 0;
}
@@ -279,7 +278,7 @@ static ssize_t b43legacy_debugfs_write(struct file *file,
goto out_unlock;
}
- buf = (char *)get_zeroed_page(GFP_KERNEL);
+ buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
if (!buf) {
err = -ENOMEM;
goto out_unlock;
@@ -298,7 +297,7 @@ static ssize_t b43legacy_debugfs_write(struct file *file,
goto out_freepage;
out_freepage:
- free_page((unsigned long)buf);
+ kfree(buf);
out_unlock:
mutex_unlock(&dev->wl->mutex);
--
2.53.0
^ permalink raw reply related
* [PATCH net-next 2/8] bnx2x: use kzalloc() to allocate mac filtering list
From: Mike Rapoport (Microsoft) @ 2026-06-30 10:59 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: Brian Norris, Edward Cree, Francesco Dolcini, Manish Chopra,
Mike Rapoport, Przemek Kitszel, Sudarsana Kalluru, Tony Nguyen,
b43-dev, intel-wired-lan, libertas-dev, linux-kernel, linux-mm,
linux-net-drivers, linux-wireless, netdev
In-Reply-To: <20260630-b4-drivers-net-v1-0-672162a91f37@kernel.org>
bnx2x_mcast_enqueue_cmd() allocates memory for mac filtering list using
__get_free_pages().
This memory can be allocated with kzalloc() as there's nothing special
about it to go directly to the page allocator.
kmalloc() provides a better API that does not require ugly casts and
kfree() does not need to know the size of the freed object.
Performance difference between kmalloc() and __get_free_pages() is not
measurable as both allocators take an object/page from a per-CPU list for
fast path allocations.
For the slow path the performance is anyway determined by the amount of
reclaim involved rather than by what allocator is used.
Replace use of __get_free_page() with kzalloc() and free_page() with
kfree().
Link: https://lore.kernel.org/all/635405e4-9423-4a25-a6e7-e03c8ea0bcbe@redhat.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
index 07a908a2c72f..d560524d317d 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
@@ -26,6 +26,7 @@
#include <linux/netdevice.h>
#include <linux/etherdevice.h>
#include <linux/crc32c.h>
+#include <linux/slab.h>
#include "bnx2x.h"
#include "bnx2x_cmn.h"
#include "bnx2x_sp.h"
@@ -2664,7 +2665,7 @@ static void bnx2x_free_groups(struct list_head *mcast_group_list)
struct bnx2x_mcast_elem_group,
mcast_group_link);
list_del(¤t_mcast_group->mcast_group_link);
- free_page((unsigned long)current_mcast_group);
+ kfree(current_mcast_group);
}
}
@@ -2713,8 +2714,7 @@ static int bnx2x_mcast_enqueue_cmd(struct bnx2x *bp,
total_elems = BNX2X_MCAST_BINS_NUM;
}
while (total_elems > 0) {
- elem_group = (struct bnx2x_mcast_elem_group *)
- __get_free_page(GFP_ATOMIC | __GFP_ZERO);
+ elem_group = kzalloc(PAGE_SIZE, GFP_ATOMIC);
if (!elem_group) {
bnx2x_free_groups(&new_cmd->group_head);
kfree(new_cmd);
--
2.53.0
^ permalink raw reply related
* [PATCH net-next 3/8] ice: use kzalloc() to allocate staging buffer for reading from GNSS
From: Mike Rapoport (Microsoft) @ 2026-06-30 10:59 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: Brian Norris, Edward Cree, Francesco Dolcini, Manish Chopra,
Mike Rapoport, Przemek Kitszel, Sudarsana Kalluru, Tony Nguyen,
b43-dev, intel-wired-lan, libertas-dev, linux-kernel, linux-mm,
linux-net-drivers, linux-wireless, netdev
In-Reply-To: <20260630-b4-drivers-net-v1-0-672162a91f37@kernel.org>
ice_gnss_read() uses get_zeroed_page() to allocate a staging buffer for
reading GNSS module data via I2C bus.
This buffer can be allocated with kmalloc() as there's nothing special
about it to go directly to the page allocator.
kmalloc() provides a better API that does not require ugly casts and
kfree() does not need to know the size of the freed object.
Performance difference between kmalloc() and __get_free_pages() is not
measurable as both allocators take an object/page from a per-CPU list for
fast path allocations.
For the slow path the performance is anyway determined by the amount of
reclaim involved rather than by what allocator is used.
Replace use of get_zeroed_page() with kzalloc() and free_page() with
kfree().
Link: https://lore.kernel.org/all/635405e4-9423-4a25-a6e7-e03c8ea0bcbe@redhat.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
drivers/net/ethernet/intel/ice/ice_gnss.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_gnss.c b/drivers/net/ethernet/intel/ice/ice_gnss.c
index 8fd954f1ebd6..7d21c3417b0b 100644
--- a/drivers/net/ethernet/intel/ice/ice_gnss.c
+++ b/drivers/net/ethernet/intel/ice/ice_gnss.c
@@ -2,6 +2,7 @@
/* Copyright (C) 2021-2022, Intel Corporation. */
#include "ice.h"
+#include <linux/slab.h>
#include "ice_lib.h"
/**
@@ -124,7 +125,7 @@ static void ice_gnss_read(struct kthread_work *work)
data_len = min_t(typeof(data_len), data_len, PAGE_SIZE);
- buf = (char *)get_zeroed_page(GFP_KERNEL);
+ buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
if (!buf) {
err = -ENOMEM;
goto requeue;
@@ -151,7 +152,7 @@ static void ice_gnss_read(struct kthread_work *work)
count, i);
delay = ICE_GNSS_TIMER_DELAY_TIME;
free_buf:
- free_page((unsigned long)buf);
+ kfree(buf);
requeue:
kthread_queue_delayed_work(gnss->kworker, &gnss->read_work, delay);
if (err)
--
2.53.0
^ permalink raw reply related
* [PATCH net-next 4/8] libertas: debugfs: use kzalloc() to allocate formatting buffers
From: Mike Rapoport (Microsoft) @ 2026-06-30 10:59 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: Brian Norris, Edward Cree, Francesco Dolcini, Manish Chopra,
Mike Rapoport, Przemek Kitszel, Sudarsana Kalluru, Tony Nguyen,
b43-dev, intel-wired-lan, libertas-dev, linux-kernel, linux-mm,
linux-net-drivers, linux-wireless, netdev
In-Reply-To: <20260630-b4-drivers-net-v1-0-672162a91f37@kernel.org>
libertas debugfs functions allocate buffers for formatting debug
output text using get_zeroed_page().
These buffers can be allocated with kmalloc() as there's nothing special
about them to go directly to the page allocator.
kmalloc() provides a better API that does not require ugly casts and
kfree() does not need to know the size of the freed object.
Performance difference between kmalloc() and __get_free_pages() is not
measurable as both allocators take an object/page from a per-CPU list for
fast path allocations.
For the slow path the performance is anyway determined by the amount of
reclaim involved rather than by what allocator is used.
Replace use of get_zeroed_page() with kzalloc() and free_page() with
kfree().
Link: https://lore.kernel.org/all/635405e4-9423-4a25-a6e7-e03c8ea0bcbe@redhat.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
drivers/net/wireless/marvell/libertas/debugfs.c | 39 ++++++++++---------------
1 file changed, 16 insertions(+), 23 deletions(-)
diff --git a/drivers/net/wireless/marvell/libertas/debugfs.c b/drivers/net/wireless/marvell/libertas/debugfs.c
index 9ebd69134940..9428f954837a 100644
--- a/drivers/net/wireless/marvell/libertas/debugfs.c
+++ b/drivers/net/wireless/marvell/libertas/debugfs.c
@@ -35,8 +35,7 @@ static ssize_t lbs_dev_info(struct file *file, char __user *userbuf,
{
struct lbs_private *priv = file->private_data;
size_t pos = 0;
- unsigned long addr = get_zeroed_page(GFP_KERNEL);
- char *buf = (char *)addr;
+ char *buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
ssize_t res;
if (!buf)
return -ENOMEM;
@@ -48,7 +47,7 @@ static ssize_t lbs_dev_info(struct file *file, char __user *userbuf,
res = simple_read_from_buffer(userbuf, count, ppos, buf, pos);
- free_page(addr);
+ kfree(buf);
return res;
}
@@ -96,8 +95,7 @@ static ssize_t lbs_sleepparams_read(struct file *file, char __user *userbuf,
ssize_t ret;
size_t pos = 0;
struct sleep_params sp;
- unsigned long addr = get_zeroed_page(GFP_KERNEL);
- char *buf = (char *)addr;
+ char *buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
if (!buf)
return -ENOMEM;
@@ -113,7 +111,7 @@ static ssize_t lbs_sleepparams_read(struct file *file, char __user *userbuf,
ret = simple_read_from_buffer(userbuf, count, ppos, buf, pos);
out_unlock:
- free_page(addr);
+ kfree(buf);
return ret;
}
@@ -165,8 +163,7 @@ static ssize_t lbs_host_sleep_read(struct file *file, char __user *userbuf,
struct lbs_private *priv = file->private_data;
ssize_t ret;
size_t pos = 0;
- unsigned long addr = get_zeroed_page(GFP_KERNEL);
- char *buf = (char *)addr;
+ char *buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
if (!buf)
return -ENOMEM;
@@ -174,7 +171,7 @@ static ssize_t lbs_host_sleep_read(struct file *file, char __user *userbuf,
ret = simple_read_from_buffer(userbuf, count, ppos, buf, pos);
- free_page(addr);
+ kfree(buf);
return ret;
}
@@ -228,7 +225,7 @@ static ssize_t lbs_threshold_read(uint16_t tlv_type, uint16_t event_mask,
u8 freq;
int events = 0;
- buf = (char *)get_zeroed_page(GFP_KERNEL);
+ buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
if (!buf)
return -ENOMEM;
@@ -261,7 +258,7 @@ static ssize_t lbs_threshold_read(uint16_t tlv_type, uint16_t event_mask,
kfree(subscribed);
out_page:
- free_page((unsigned long)buf);
+ kfree(buf);
return ret;
}
@@ -436,8 +433,7 @@ static ssize_t lbs_rdmac_read(struct file *file, char __user *userbuf,
struct lbs_private *priv = file->private_data;
ssize_t pos = 0;
int ret;
- unsigned long addr = get_zeroed_page(GFP_KERNEL);
- char *buf = (char *)addr;
+ char *buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
u32 val = 0;
if (!buf)
@@ -450,7 +446,7 @@ static ssize_t lbs_rdmac_read(struct file *file, char __user *userbuf,
priv->mac_offset, val);
ret = simple_read_from_buffer(userbuf, count, ppos, buf, pos);
}
- free_page(addr);
+ kfree(buf);
return ret;
}
@@ -506,8 +502,7 @@ static ssize_t lbs_rdbbp_read(struct file *file, char __user *userbuf,
struct lbs_private *priv = file->private_data;
ssize_t pos = 0;
int ret;
- unsigned long addr = get_zeroed_page(GFP_KERNEL);
- char *buf = (char *)addr;
+ char *buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
u32 val;
if (!buf)
@@ -520,7 +515,7 @@ static ssize_t lbs_rdbbp_read(struct file *file, char __user *userbuf,
priv->bbp_offset, val);
ret = simple_read_from_buffer(userbuf, count, ppos, buf, pos);
}
- free_page(addr);
+ kfree(buf);
return ret;
}
@@ -578,8 +573,7 @@ static ssize_t lbs_rdrf_read(struct file *file, char __user *userbuf,
struct lbs_private *priv = file->private_data;
ssize_t pos = 0;
int ret;
- unsigned long addr = get_zeroed_page(GFP_KERNEL);
- char *buf = (char *)addr;
+ char *buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
u32 val;
if (!buf)
@@ -592,7 +586,7 @@ static ssize_t lbs_rdrf_read(struct file *file, char __user *userbuf,
priv->rf_offset, val);
ret = simple_read_from_buffer(userbuf, count, ppos, buf, pos);
}
- free_page(addr);
+ kfree(buf);
return ret;
}
@@ -812,8 +806,7 @@ static ssize_t lbs_debugfs_read(struct file *file, char __user *userbuf,
char *p;
int i;
struct debug_data *d;
- unsigned long addr = get_zeroed_page(GFP_KERNEL);
- char *buf = (char *)addr;
+ char *buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
if (!buf)
return -ENOMEM;
@@ -836,7 +829,7 @@ static ssize_t lbs_debugfs_read(struct file *file, char __user *userbuf,
res = simple_read_from_buffer(userbuf, count, ppos, p, pos);
- free_page(addr);
+ kfree(buf);
return res;
}
--
2.53.0
^ permalink raw reply related
* [PATCH net-next 5/8] mwifiex: debugfs: use kzalloc() to allocate formatting buffers
From: Mike Rapoport (Microsoft) @ 2026-06-30 10:59 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: Brian Norris, Edward Cree, Francesco Dolcini, Manish Chopra,
Mike Rapoport, Przemek Kitszel, Sudarsana Kalluru, Tony Nguyen,
b43-dev, intel-wired-lan, libertas-dev, linux-kernel, linux-mm,
linux-net-drivers, linux-wireless, netdev
In-Reply-To: <20260630-b4-drivers-net-v1-0-672162a91f37@kernel.org>
mwifiex debugfs functions allocate buffers for formatting debug output
text using get_zeroed_page().
These buffers can be allocated with kmalloc() as there's nothing special
about them to go directly to the page allocator.
kmalloc() provides a better API that does not require ugly casts and
kfree() does not need to know the size of the freed object.
Performance difference between kmalloc() and __get_free_pages() is not
measurable as both allocators take an object/page from a per-CPU list for
fast path allocations.
For the slow path the performance is anyway determined by the amount of
reclaim involved rather than by what allocator is used.
Replace use of get_zeroed_page() with kzalloc() and free_page() with
kfree().
Link: https://lore.kernel.org/all/635405e4-9423-4a25-a6e7-e03c8ea0bcbe@redhat.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
drivers/net/wireless/marvell/mwifiex/debugfs.c | 62 +++++++++++---------------
1 file changed, 27 insertions(+), 35 deletions(-)
diff --git a/drivers/net/wireless/marvell/mwifiex/debugfs.c b/drivers/net/wireless/marvell/mwifiex/debugfs.c
index 9deaf59dcb62..573768b6ad91 100644
--- a/drivers/net/wireless/marvell/mwifiex/debugfs.c
+++ b/drivers/net/wireless/marvell/mwifiex/debugfs.c
@@ -6,6 +6,7 @@
*/
#include <linux/debugfs.h>
+#include <linux/slab.h>
#include "main.h"
#include "11n.h"
@@ -67,8 +68,8 @@ mwifiex_info_read(struct file *file, char __user *ubuf,
struct net_device *netdev = priv->netdev;
struct netdev_hw_addr *ha;
struct netdev_queue *txq;
- unsigned long page = get_zeroed_page(GFP_KERNEL);
- char *p = (char *) page, fmt[64];
+ char *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+ char *p = page, fmt[64];
struct mwifiex_bss_info info;
ssize_t ret;
int i = 0;
@@ -133,11 +134,10 @@ mwifiex_info_read(struct file *file, char __user *ubuf,
}
p += sprintf(p, "\n");
- ret = simple_read_from_buffer(ubuf, count, ppos, (char *) page,
- (unsigned long) p - page);
+ ret = simple_read_from_buffer(ubuf, count, ppos, page, p - page);
free_and_exit:
- free_page(page);
+ kfree(page);
return ret;
}
@@ -168,8 +168,8 @@ mwifiex_getlog_read(struct file *file, char __user *ubuf,
{
struct mwifiex_private *priv =
(struct mwifiex_private *) file->private_data;
- unsigned long page = get_zeroed_page(GFP_KERNEL);
- char *p = (char *) page;
+ char *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+ char *p = page;
ssize_t ret;
struct mwifiex_ds_get_stats stats;
@@ -220,11 +220,10 @@ mwifiex_getlog_read(struct file *file, char __user *ubuf,
stats.bcn_miss_cnt);
- ret = simple_read_from_buffer(ubuf, count, ppos, (char *) page,
- (unsigned long) p - page);
+ ret = simple_read_from_buffer(ubuf, count, ppos, page, p - page);
free_and_exit:
- free_page(page);
+ kfree(page);
return ret;
}
@@ -247,8 +246,8 @@ mwifiex_histogram_read(struct file *file, char __user *ubuf,
ssize_t ret;
struct mwifiex_histogram_data *phist_data;
int i, value;
- unsigned long page = get_zeroed_page(GFP_KERNEL);
- char *p = (char *)page;
+ char *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+ char *p = page;
if (!p)
return -ENOMEM;
@@ -309,11 +308,10 @@ mwifiex_histogram_read(struct file *file, char __user *ubuf,
i, value);
}
- ret = simple_read_from_buffer(ubuf, count, ppos, (char *)page,
- (unsigned long)p - page);
+ ret = simple_read_from_buffer(ubuf, count, ppos, page, p - page);
free_and_exit:
- free_page(page);
+ kfree(page);
return ret;
}
@@ -383,8 +381,8 @@ mwifiex_debug_read(struct file *file, char __user *ubuf,
{
struct mwifiex_private *priv =
(struct mwifiex_private *) file->private_data;
- unsigned long page = get_zeroed_page(GFP_KERNEL);
- char *p = (char *) page;
+ char *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+ char *p = page;
ssize_t ret;
if (!p)
@@ -396,11 +394,10 @@ mwifiex_debug_read(struct file *file, char __user *ubuf,
p += mwifiex_debug_info_to_buffer(priv, p, &info);
- ret = simple_read_from_buffer(ubuf, count, ppos, (char *) page,
- (unsigned long) p - page);
+ ret = simple_read_from_buffer(ubuf, count, ppos, page, p - page);
free_and_exit:
- free_page(page);
+ kfree(page);
return ret;
}
@@ -457,8 +454,7 @@ mwifiex_regrdwr_read(struct file *file, char __user *ubuf,
{
struct mwifiex_private *priv =
(struct mwifiex_private *) file->private_data;
- unsigned long addr = get_zeroed_page(GFP_KERNEL);
- char *buf = (char *) addr;
+ char *buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
int pos = 0, ret = 0;
u32 reg_value;
@@ -497,7 +493,7 @@ mwifiex_regrdwr_read(struct file *file, char __user *ubuf,
ret = simple_read_from_buffer(ubuf, count, ppos, buf, pos);
done:
- free_page(addr);
+ kfree(buf);
return ret;
}
@@ -511,8 +507,7 @@ mwifiex_debug_mask_read(struct file *file, char __user *ubuf,
{
struct mwifiex_private *priv =
(struct mwifiex_private *)file->private_data;
- unsigned long page = get_zeroed_page(GFP_KERNEL);
- char *buf = (char *)page;
+ char *buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
size_t ret = 0;
int pos = 0;
@@ -523,7 +518,7 @@ mwifiex_debug_mask_read(struct file *file, char __user *ubuf,
priv->adapter->debug_mask);
ret = simple_read_from_buffer(ubuf, count, ppos, buf, pos);
- free_page(page);
+ kfree(buf);
return ret;
}
@@ -652,8 +647,7 @@ mwifiex_memrw_read(struct file *file, char __user *ubuf,
size_t count, loff_t *ppos)
{
struct mwifiex_private *priv = (void *)file->private_data;
- unsigned long addr = get_zeroed_page(GFP_KERNEL);
- char *buf = (char *)addr;
+ char *buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
int ret, pos = 0;
if (!buf)
@@ -663,7 +657,7 @@ mwifiex_memrw_read(struct file *file, char __user *ubuf,
priv->mem_rw.value);
ret = simple_read_from_buffer(ubuf, count, ppos, buf, pos);
- free_page(addr);
+ kfree(buf);
return ret;
}
@@ -719,8 +713,7 @@ mwifiex_rdeeprom_read(struct file *file, char __user *ubuf,
{
struct mwifiex_private *priv =
(struct mwifiex_private *) file->private_data;
- unsigned long addr = get_zeroed_page(GFP_KERNEL);
- char *buf = (char *) addr;
+ char *buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
int pos, ret, i;
u8 value[MAX_EEPROM_DATA];
@@ -749,7 +742,7 @@ mwifiex_rdeeprom_read(struct file *file, char __user *ubuf,
done:
ret = simple_read_from_buffer(ubuf, count, ppos, buf, pos);
out_free:
- free_page(addr);
+ kfree(buf);
return ret;
}
@@ -820,8 +813,7 @@ mwifiex_hscfg_read(struct file *file, char __user *ubuf,
size_t count, loff_t *ppos)
{
struct mwifiex_private *priv = (void *)file->private_data;
- unsigned long addr = get_zeroed_page(GFP_KERNEL);
- char *buf = (char *)addr;
+ char *buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
int pos, ret;
struct mwifiex_ds_hs_cfg hscfg;
@@ -836,7 +828,7 @@ mwifiex_hscfg_read(struct file *file, char __user *ubuf,
ret = simple_read_from_buffer(ubuf, count, ppos, buf, pos);
- free_page(addr);
+ kfree(buf);
return ret;
}
--
2.53.0
^ permalink raw reply related
* [PATCH net-next 6/8] sfc/siena: use kmalloc() to allocate logging buffer
From: Mike Rapoport (Microsoft) @ 2026-06-30 10:59 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: Brian Norris, Edward Cree, Francesco Dolcini, Manish Chopra,
Mike Rapoport, Przemek Kitszel, Sudarsana Kalluru, Tony Nguyen,
b43-dev, intel-wired-lan, libertas-dev, linux-kernel, linux-mm,
linux-net-drivers, linux-wireless, netdev
In-Reply-To: <20260630-b4-drivers-net-v1-0-672162a91f37@kernel.org>
efx_siena_mcdi_init() allocates a logging buffer for MCDI firmware
communication diagnostics.
This buffer can be allocated with kmalloc() as there's nothing special
about it to go directly to the page allocator.
kmalloc() provides a better API that does not require ugly casts and
kfree() does not need to know the size of the freed object.
Performance difference between kmalloc() and __get_free_pages() is not
measurable as both allocators take an object/page from a per-CPU list for
fast path allocations.
For the slow path the performance is anyway determined by the amount of
reclaim involved rather than by what allocator is used.
Replace use of __get_free_page() with kmalloc() and free_page() with
kfree().
Link: https://lore.kernel.org/all/635405e4-9423-4a25-a6e7-e03c8ea0bcbe@redhat.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
drivers/net/ethernet/sfc/siena/mcdi.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/sfc/siena/mcdi.c b/drivers/net/ethernet/sfc/siena/mcdi.c
index 4d0d6bd5d3d1..048c1e6017c0 100644
--- a/drivers/net/ethernet/sfc/siena/mcdi.c
+++ b/drivers/net/ethernet/sfc/siena/mcdi.c
@@ -7,6 +7,7 @@
#include <linux/delay.h>
#include <linux/moduleparam.h>
#include <linux/atomic.h>
+#include <linux/slab.h>
#include "net_driver.h"
#include "nic.h"
#include "io.h"
@@ -73,7 +74,7 @@ int efx_siena_mcdi_init(struct efx_nic *efx)
mcdi->efx = efx;
#ifdef CONFIG_SFC_SIENA_MCDI_LOGGING
/* consuming code assumes buffer is page-sized */
- mcdi->logging_buffer = (char *)__get_free_page(GFP_KERNEL);
+ mcdi->logging_buffer = kmalloc(PAGE_SIZE, GFP_KERNEL);
if (!mcdi->logging_buffer)
goto fail1;
mcdi->logging_enabled = efx_siena_mcdi_logging_default;
@@ -116,7 +117,7 @@ int efx_siena_mcdi_init(struct efx_nic *efx)
return 0;
fail2:
#ifdef CONFIG_SFC_SIENA_MCDI_LOGGING
- free_page((unsigned long)mcdi->logging_buffer);
+ kfree(mcdi->logging_buffer);
fail1:
#endif
kfree(efx->mcdi);
@@ -142,7 +143,7 @@ void efx_siena_mcdi_fini(struct efx_nic *efx)
return;
#ifdef CONFIG_SFC_SIENA_MCDI_LOGGING
- free_page((unsigned long)efx->mcdi->iface.logging_buffer);
+ kfree(efx->mcdi->iface.logging_buffer);
#endif
kfree(efx->mcdi);
--
2.53.0
^ permalink raw reply related
* [PATCH net-next 7/8] sfc: use kmalloc() to allocate logging buffer
From: Mike Rapoport (Microsoft) @ 2026-06-30 10:59 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: Brian Norris, Edward Cree, Francesco Dolcini, Manish Chopra,
Mike Rapoport, Przemek Kitszel, Sudarsana Kalluru, Tony Nguyen,
b43-dev, intel-wired-lan, libertas-dev, linux-kernel, linux-mm,
linux-net-drivers, linux-wireless, netdev
In-Reply-To: <20260630-b4-drivers-net-v1-0-672162a91f37@kernel.org>
efx_mcdi_init() allocates a logging buffer for MCDI firmware
communication diagnostics.
This buffer can be allocated with kmalloc() as there's nothing special
about it to go directly to the page allocator.
kmalloc() provides a better API that does not require ugly casts and
kfree() does not need to know the size of the freed object.
Performance difference between kmalloc() and __get_free_pages() is not
measurable as both allocators take an object/page from a per-CPU list for
fast path allocations.
For the slow path the performance is anyway determined by the amount of
reclaim involved rather than by what allocator is used.
Replace use of __get_free_page() with kmalloc() and free_page() with
kfree().
Link: https://lore.kernel.org/all/635405e4-9423-4a25-a6e7-e03c8ea0bcbe@redhat.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
drivers/net/ethernet/sfc/mcdi.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/sfc/mcdi.c b/drivers/net/ethernet/sfc/mcdi.c
index e65db9b70724..b806d3d90c42 100644
--- a/drivers/net/ethernet/sfc/mcdi.c
+++ b/drivers/net/ethernet/sfc/mcdi.c
@@ -7,6 +7,7 @@
#include <linux/delay.h>
#include <linux/moduleparam.h>
#include <linux/atomic.h>
+#include <linux/slab.h>
#include "net_driver.h"
#include "nic.h"
#include "io.h"
@@ -71,7 +72,7 @@ int efx_mcdi_init(struct efx_nic *efx)
mcdi->efx = efx;
#ifdef CONFIG_SFC_MCDI_LOGGING
/* consuming code assumes buffer is page-sized */
- mcdi->logging_buffer = (char *)__get_free_page(GFP_KERNEL);
+ mcdi->logging_buffer = kmalloc(PAGE_SIZE, GFP_KERNEL);
if (!mcdi->logging_buffer)
goto fail1;
mcdi->logging_enabled = mcdi_logging_default;
@@ -112,7 +113,7 @@ int efx_mcdi_init(struct efx_nic *efx)
return 0;
fail2:
#ifdef CONFIG_SFC_MCDI_LOGGING
- free_page((unsigned long)mcdi->logging_buffer);
+ kfree(mcdi->logging_buffer);
fail1:
#endif
kfree(efx->mcdi);
@@ -138,7 +139,7 @@ void efx_mcdi_fini(struct efx_nic *efx)
return;
#ifdef CONFIG_SFC_MCDI_LOGGING
- free_page((unsigned long)efx->mcdi->iface.logging_buffer);
+ kfree(efx->mcdi->iface.logging_buffer);
#endif
kfree(efx->mcdi);
--
2.53.0
^ permalink raw reply related
* [PATCH net-next 8/8] wlcore: allocate aggregation and firmware log buffers with kzalloc()
From: Mike Rapoport (Microsoft) @ 2026-06-30 10:59 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: Brian Norris, Edward Cree, Francesco Dolcini, Manish Chopra,
Mike Rapoport, Przemek Kitszel, Sudarsana Kalluru, Tony Nguyen,
b43-dev, intel-wired-lan, libertas-dev, linux-kernel, linux-mm,
linux-net-drivers, linux-wireless, netdev
In-Reply-To: <20260630-b4-drivers-net-v1-0-672162a91f37@kernel.org>
wlcore_alloc_hw() uses __get_free_pages() to allocate TX aggregation
and firmware log buffers used for software data staging.
These buffer can be allocated with kmalloc() as there's nothing special
about them to go directly to the page allocator.
kmalloc() provides a better API that does not require ugly casts and
kfree() does not need to know the size of the freed object.
Performance difference between kmalloc() and __get_free_pages() is not
measurable as both allocators take an object/page from a per-CPU list for
fast path allocations.
For the slow path the performance is anyway determined by the amount of
reclaim involved rather than by what allocator is used.
Replace use of __get_free_pages() with kzalloc() and free_pages() with
kfree().
Link: https://lore.kernel.org/all/635405e4-9423-4a25-a6e7-e03c8ea0bcbe@redhat.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
drivers/net/wireless/ti/wlcore/main.c | 14 ++++++--------
1 file changed, 6 insertions(+), 8 deletions(-)
diff --git a/drivers/net/wireless/ti/wlcore/main.c b/drivers/net/wireless/ti/wlcore/main.c
index be583ae331c0..5595f7a1fc0c 100644
--- a/drivers/net/wireless/ti/wlcore/main.c
+++ b/drivers/net/wireless/ti/wlcore/main.c
@@ -6354,7 +6354,6 @@ struct ieee80211_hw *wlcore_alloc_hw(size_t priv_size, u32 aggr_buf_size,
struct ieee80211_hw *hw;
struct wl1271 *wl;
int i, j, ret;
- unsigned int order;
hw = ieee80211_alloc_hw(sizeof(*wl), &wl1271_ops);
if (!hw) {
@@ -6434,8 +6433,7 @@ struct ieee80211_hw *wlcore_alloc_hw(size_t priv_size, u32 aggr_buf_size,
mutex_init(&wl->flush_mutex);
init_completion(&wl->nvs_loading_complete);
- order = get_order(aggr_buf_size);
- wl->aggr_buf = (u8 *)__get_free_pages(GFP_KERNEL, order);
+ wl->aggr_buf = kmalloc(round_up(aggr_buf_size, PAGE_SIZE), GFP_KERNEL);
if (!wl->aggr_buf) {
ret = -ENOMEM;
goto err_wq;
@@ -6449,7 +6447,7 @@ struct ieee80211_hw *wlcore_alloc_hw(size_t priv_size, u32 aggr_buf_size,
}
/* Allocate one page for the FW log */
- wl->fwlog = (u8 *)get_zeroed_page(GFP_KERNEL);
+ wl->fwlog = kzalloc(PAGE_SIZE, GFP_KERNEL);
if (!wl->fwlog) {
ret = -ENOMEM;
goto err_dummy_packet;
@@ -6474,13 +6472,13 @@ struct ieee80211_hw *wlcore_alloc_hw(size_t priv_size, u32 aggr_buf_size,
kfree(wl->mbox);
err_fwlog:
- free_page((unsigned long)wl->fwlog);
+ kfree(wl->fwlog);
err_dummy_packet:
dev_kfree_skb(wl->dummy_packet);
err_aggr:
- free_pages((unsigned long)wl->aggr_buf, order);
+ kfree(wl->aggr_buf);
err_wq:
destroy_workqueue(wl->freezable_wq);
@@ -6509,9 +6507,9 @@ int wlcore_free_hw(struct wl1271 *wl)
kfree(wl->buffer_32);
kfree(wl->mbox);
- free_page((unsigned long)wl->fwlog);
+ kfree(wl->fwlog);
dev_kfree_skb(wl->dummy_packet);
- free_pages((unsigned long)wl->aggr_buf, get_order(wl->aggr_buf_size));
+ kfree(wl->aggr_buf);
wl1271_debugfs_exit(wl);
--
2.53.0
^ permalink raw reply related
* [PATCH net-next 0/3] net: report multicast group user count
From: Yuyang Huang @ 2026-06-30 11:02 UTC (permalink / raw)
To: Yuyang Huang
Cc: David S. Miller, Andrew Lunn, David Ahern, Donald Hunter,
Eric Dumazet, Ido Schimmel, Jakub Kicinski, Paolo Abeni,
Shuah Khan, Simon Horman, linux-kernel, linux-kselftest, netdev
RTM_GETMULTICAST reports IPv4 and IPv6 multicast group membership, but
does not include the per-group user count. Userspace therefore still has
to parse /proc/net/igmp and /proc/net/igmp6 to obtain the Users column.
In particular, this prevents iproute2 from moving "ip maddr show"
entirely from procfs to rtnetlink.
Add IFA_MC_USERS to carry the user count in RTM_GETMULTICAST dumps and
RTM_NEWMULTICAST / RTM_DELMULTICAST notifications for both address
families. Update the rt-addr YNL specification and extend the rtnetlink
selftest to verify that two joins increase the reported count by two.
Yuyang Huang (3):
net: ipv4: report multicast group user count
net: ipv6: report multicast group user count
selftests: net: check multicast group user count
Documentation/netlink/specs/rt-addr.yaml | 4 +
include/uapi/linux/if_addr.h | 1 +
net/ipv4/igmp.c | 2 +
net/ipv6/addrconf.c | 1 +
net/ipv6/mcast.c | 1 +
tools/testing/selftests/net/rtnetlink.py | 101 ++++++++++++++++++++---
6 files changed, 99 insertions(+), 11 deletions(-)
--
2.43.0
^ permalink raw reply
* [PATCH net-next 1/3] net: ipv4: report multicast group user count
From: Yuyang Huang @ 2026-06-30 11:02 UTC (permalink / raw)
To: Yuyang Huang
Cc: David S. Miller, Andrew Lunn, David Ahern, Donald Hunter,
Eric Dumazet, Ido Schimmel, Jakub Kicinski, Paolo Abeni,
Shuah Khan, Simon Horman, linux-kernel, linux-kselftest, netdev
In-Reply-To: <20260630110207.37841-1-sigefriedhyy@gmail.com>
RTM_GETMULTICAST has been part of the rtnetlink ABI for a long time
and already reports IPv4 multicast group membership through
IFA_MULTICAST and IFA_CACHEINFO. It does not report how many consumers
hold each membership, so userspace still has to parse /proc/net/igmp to
get the Users column.
Add IFA_MC_USERS as a u32 attribute carrying ip_mc_list::users in
RTM_GETMULTICAST replies and entry-lifecycle notifications.
This gives iproute2 enough information to migrate the IPv4 part of
"ip maddr show" from procfs parsing to rtnetlink.
Signed-off-by: Yuyang Huang <sigefriedhyy@gmail.com>
---
Documentation/netlink/specs/rt-addr.yaml | 4 ++++
include/uapi/linux/if_addr.h | 1 +
net/ipv4/igmp.c | 2 ++
3 files changed, 7 insertions(+)
diff --git a/Documentation/netlink/specs/rt-addr.yaml b/Documentation/netlink/specs/rt-addr.yaml
index 163a106c41bb..0ecbd24c890c 100644
--- a/Documentation/netlink/specs/rt-addr.yaml
+++ b/Documentation/netlink/specs/rt-addr.yaml
@@ -123,6 +123,9 @@ attribute-sets:
-
name: proto
type: u8
+ -
+ name: mc-users
+ type: u32
operations:
@@ -176,6 +179,7 @@ operations:
value: 58
attributes: &mcaddr-attrs
- multicast
+ - mc-users
- cacheinfo
dump:
request:
diff --git a/include/uapi/linux/if_addr.h b/include/uapi/linux/if_addr.h
index aa7958b4e41d..7fb630b7fe31 100644
--- a/include/uapi/linux/if_addr.h
+++ b/include/uapi/linux/if_addr.h
@@ -36,6 +36,7 @@ enum {
IFA_RT_PRIORITY, /* u32, priority/metric for prefix route */
IFA_TARGET_NETNSID,
IFA_PROTO, /* u8, address protocol */
+ IFA_MC_USERS, /* u32, multicast group users */
__IFA_MAX,
};
diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index b6337a47c141..116ce7cec80e 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -1473,6 +1473,7 @@ int inet_fill_ifmcaddr(struct sk_buff *skb, struct net_device *dev,
ci.ifa_valid = INFINITY_LIFE_TIME;
if (nla_put_in_addr(skb, IFA_MULTICAST, im->multiaddr) < 0 ||
+ nla_put_u32(skb, IFA_MC_USERS, READ_ONCE(im->users)) < 0 ||
nla_put(skb, IFA_CACHEINFO, sizeof(ci), &ci) < 0) {
nlmsg_cancel(skb, nlh);
return -EMSGSIZE;
@@ -1494,6 +1495,7 @@ static void inet_ifmcaddr_notify(struct net_device *dev,
skb = nlmsg_new(NLMSG_ALIGN(sizeof(struct ifaddrmsg)) +
nla_total_size(sizeof(__be32)) +
+ nla_total_size(sizeof(u32)) +
nla_total_size(sizeof(struct ifa_cacheinfo)),
GFP_KERNEL);
if (!skb)
--
2.43.0
^ permalink raw reply related
* [PATCH net-next 2/3] net: ipv6: report multicast group user count
From: Yuyang Huang @ 2026-06-30 11:02 UTC (permalink / raw)
To: Yuyang Huang
Cc: David S. Miller, Andrew Lunn, David Ahern, Donald Hunter,
Eric Dumazet, Ido Schimmel, Jakub Kicinski, Paolo Abeni,
Shuah Khan, Simon Horman, linux-kernel, linux-kselftest, netdev
In-Reply-To: <20260630110207.37841-1-sigefriedhyy@gmail.com>
The previous patch added IFA_MC_USERS and emits it for IPv4 multicast
groups. Add the same snapshot attribute to IPv6 RTM_GETMULTICAST
replies and entry-lifecycle notifications, carrying
ifmcaddr6::mca_users.
This makes the multicast rtnetlink ABI symmetric across IPv4 and IPv6
and gives userspace the same user count that /proc/net/igmp6 exposes.
Signed-off-by: Yuyang Huang <sigefriedhyy@gmail.com>
---
net/ipv6/addrconf.c | 1 +
net/ipv6/mcast.c | 1 +
2 files changed, 2 insertions(+)
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index cbe681de3818..f1fe9ede1edb 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -5264,6 +5264,7 @@ int inet6_fill_ifmcaddr(struct sk_buff *skb,
put_ifaddrmsg(nlh, 128, IFA_F_PERMANENT, scope, ifindex);
if (nla_put_in6_addr(skb, IFA_MULTICAST, &ifmca->mca_addr) < 0 ||
+ nla_put_u32(skb, IFA_MC_USERS, READ_ONCE(ifmca->mca_users)) < 0 ||
put_cacheinfo(skb, ifmca->mca_cstamp, READ_ONCE(ifmca->mca_tstamp),
INFINITY_LIFE_TIME, INFINITY_LIFE_TIME) < 0) {
nlmsg_cancel(skb, nlh);
diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c
index 04b811b3be97..774f4c72a6fa 100644
--- a/net/ipv6/mcast.c
+++ b/net/ipv6/mcast.c
@@ -908,6 +908,7 @@ static void inet6_ifmcaddr_notify(struct net_device *dev,
skb = nlmsg_new(NLMSG_ALIGN(sizeof(struct ifaddrmsg)) +
nla_total_size(sizeof(struct in6_addr)) +
+ nla_total_size(sizeof(u32)) +
nla_total_size(sizeof(struct ifa_cacheinfo)),
GFP_KERNEL);
if (!skb)
--
2.43.0
^ permalink raw reply related
* [PATCH net-next 3/3] selftests: net: check multicast group user count
From: Yuyang Huang @ 2026-06-30 11:02 UTC (permalink / raw)
To: Yuyang Huang
Cc: David S. Miller, Andrew Lunn, David Ahern, Donald Hunter,
Eric Dumazet, Ido Schimmel, Jakub Kicinski, Paolo Abeni,
Shuah Khan, Simon Horman, linux-kernel, linux-kselftest, netdev
In-Reply-To: <20260630110207.37841-1-sigefriedhyy@gmail.com>
Extend the RTM_GETMULTICAST dump test to verify IFA_MC_USERS for both
IPv4 and IPv6 multicast groups.
Run each protocol test in a fresh network namespace to avoid changing
host-network state or racing with unrelated multicast users. Join a
fixed multicast group twice using separate sockets and check that the
reported user count increases by two.
Signed-off-by: Yuyang Huang <sigefriedhyy@gmail.com>
---
tools/testing/selftests/net/rtnetlink.py | 101 ++++++++++++++++++++---
1 file changed, 90 insertions(+), 11 deletions(-)
diff --git a/tools/testing/selftests/net/rtnetlink.py b/tools/testing/selftests/net/rtnetlink.py
index 3622413d793d..0c67c7c00d84 100755
--- a/tools/testing/selftests/net/rtnetlink.py
+++ b/tools/testing/selftests/net/rtnetlink.py
@@ -2,27 +2,106 @@
# SPDX-License-Identifier: GPL-2.0
import socket
+import struct
import time
-from lib.py import bkg, ip, ksft_exit, ksft_run, ksft_ge, ksft_true, KsftSkipEx
+from lib.py import bkg, ip, ksft_exit, ksft_run, ksft_eq, ksft_ge, ksft_true, KsftSkipEx
from lib.py import CmdExitFailure, NetNS, NetNSEnter, RtnlAddrFamily
IPV4_ALL_HOSTS_MULTICAST = b'\xe0\x00\x00\x01'
+IPV4_TEST_MULTICAST = b'\xef\x01\x01\x01'
+IPV6_TEST_MULTICAST = bytes.fromhex('ff020000000000000000000000000123')
+
+
+def _users_for(rtnl: RtnlAddrFamily, family: int, grp: bytes, ifindex: int):
+ """Return mc-users for grp on ifindex, or 0 if absent."""
+
+ addrs = rtnl.getmulticast({"ifa-family": family}, dump=True)
+ matches = [addr for addr in addrs
+ if addr['multicast'] == grp and addr['ifa-index'] == ifindex]
+ if not matches:
+ return 0
+ if 'mc-users' not in matches[0]:
+ return None
+
+ return matches[0]['mc-users']
+
def dump_mcaddr_check() -> None:
"""
- Verify that at least one interface has the IPv4 all-hosts multicast address.
- At least the loopback interface should have this address.
+ Verify IPv4 multicast addresses and their user counts in RTM_GETMULTICAST.
+ """
+
+ with NetNS() as ns:
+ with NetNSEnter(str(ns)):
+ ip("link set lo up")
+ rtnl = RtnlAddrFamily()
+ lo_idx = socket.if_nametoindex('lo')
+ addresses = rtnl.getmulticast({"ifa-family": socket.AF_INET}, dump=True)
+
+ all_host_multicasts = [
+ addr for addr in addresses
+ if addr['multicast'] == IPV4_ALL_HOSTS_MULTICAST
+ ]
+
+ ksft_ge(len(all_host_multicasts), 1,
+ "No interface found with the IPv4 all-hosts multicast address")
+
+ mreq = IPV4_TEST_MULTICAST + socket.inet_aton('127.0.0.1')
+ before = _users_for(rtnl, socket.AF_INET, IPV4_TEST_MULTICAST, lo_idx)
+ if before is None:
+ raise KsftSkipEx("kernel does not expose IFA_MC_USERS")
+
+ s1 = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
+ s2 = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
+ try:
+ s1.setsockopt(socket.IPPROTO_IP, socket.IP_ADD_MEMBERSHIP, mreq)
+ s2.setsockopt(socket.IPPROTO_IP, socket.IP_ADD_MEMBERSHIP, mreq)
+
+ after_join = _users_for(rtnl, socket.AF_INET,
+ IPV4_TEST_MULTICAST, lo_idx)
+ if after_join is None:
+ raise KsftSkipEx("kernel does not expose IFA_MC_USERS")
+ ksft_eq(after_join - before, 2,
+ f"users delta != 2 after two joins "
+ f"(before={before}, after={after_join})")
+ finally:
+ s1.close()
+ s2.close()
+
+
+def dump_mcaddr6_check() -> None:
+ """
+ Verify IPv6 multicast addresses and their user counts in RTM_GETMULTICAST.
"""
- rtnl = RtnlAddrFamily()
- addresses = rtnl.getmulticast({"ifa-family": socket.AF_INET}, dump=True)
+ with NetNS() as ns:
+ with NetNSEnter(str(ns)):
+ ip("link set lo up")
+ rtnl = RtnlAddrFamily()
+ lo_idx = socket.if_nametoindex('lo')
+ before = _users_for(rtnl, socket.AF_INET6,
+ IPV6_TEST_MULTICAST, lo_idx)
+ if before is None:
+ raise KsftSkipEx("kernel does not expose IFA_MC_USERS for IPv6")
+
+ mreq = IPV6_TEST_MULTICAST + struct.pack('=I', lo_idx)
+ s1 = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM)
+ s2 = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM)
+ try:
+ s1.setsockopt(socket.IPPROTO_IPV6, socket.IPV6_JOIN_GROUP, mreq)
+ s2.setsockopt(socket.IPPROTO_IPV6, socket.IPV6_JOIN_GROUP, mreq)
- all_host_multicasts = [
- addr for addr in addresses if addr['multicast'] == IPV4_ALL_HOSTS_MULTICAST
- ]
+ after_join = _users_for(rtnl, socket.AF_INET6,
+ IPV6_TEST_MULTICAST, lo_idx)
+ if after_join is None:
+ raise KsftSkipEx("kernel does not expose IFA_MC_USERS for IPv6")
+ ksft_eq(after_join - before, 2,
+ f"IPv6 users delta != 2 after two joins "
+ f"(before={before}, after={after_join})")
+ finally:
+ s1.close()
+ s2.close()
- ksft_ge(len(all_host_multicasts), 1,
- "No interface found with the IPv4 all-hosts multicast address")
def ipv4_devconf_notify() -> None:
"""
@@ -56,7 +135,7 @@ def ipv4_devconf_notify() -> None:
f"No 'forwarding on' notificiation found for interface {ifname}")
def main() -> None:
- ksft_run([dump_mcaddr_check, ipv4_devconf_notify])
+ ksft_run([dump_mcaddr_check, dump_mcaddr6_check, ipv4_devconf_notify])
ksft_exit()
if __name__ == "__main__":
--
2.43.0
^ permalink raw reply related
* [PATCH v2 3/7] ipack: tpci200: don't keep pci_device_id
From: Gary Guo @ 2026-06-30 11:09 UTC (permalink / raw)
To: Bjorn Helgaas, Zhenzhong Duan, Greg Kroah-Hartman,
Rafael J. Wysocki, Danilo Krummrich, Damien Le Moal,
Niklas Cassel, GOTO Masanori, YOKOTA Hiroshi,
James E.J. Bottomley, Martin K. Petersen, Vaibhav Gupta,
Jens Taprogge, Ido Schimmel, Petr Machata, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
Cc: linux-pci, driver-core, linux-kernel, linux-ide, linux-scsi,
industrypack-devel, netdev, Gary Guo
In-Reply-To: <20260630-pci_id_fix-v2-0-b834a98c0af2@garyguo.net>
pci_device_id is not guaranteed to live longer than probe due to presence
of dynamic ID. This stored ID is unused so remove it.
Signed-off-by: Gary Guo <gary@garyguo.net>
---
drivers/ipack/carriers/tpci200.c | 1 -
drivers/ipack/carriers/tpci200.h | 1 -
2 files changed, 2 deletions(-)
diff --git a/drivers/ipack/carriers/tpci200.c b/drivers/ipack/carriers/tpci200.c
index 05dcb6675cd6..1cf51f763293 100644
--- a/drivers/ipack/carriers/tpci200.c
+++ b/drivers/ipack/carriers/tpci200.c
@@ -562,7 +562,6 @@ static int tpci200_pci_probe(struct pci_dev *pdev,
/* Save struct pci_dev pointer */
tpci200->info->pdev = pdev;
- tpci200->info->id_table = (struct pci_device_id *)id;
/* register the device and initialize it */
ret = tpci200_install(tpci200);
diff --git a/drivers/ipack/carriers/tpci200.h b/drivers/ipack/carriers/tpci200.h
index e79ac64abcff..a2bf3125794b 100644
--- a/drivers/ipack/carriers/tpci200.h
+++ b/drivers/ipack/carriers/tpci200.h
@@ -145,7 +145,6 @@ struct tpci200_slot {
*/
struct tpci200_infos {
struct pci_dev *pdev;
- struct pci_device_id *id_table;
struct tpci200_regs __iomem *interface_regs;
void __iomem *cfg_regs;
struct ipack_bus_device *ipack_bus;
--
2.54.0
^ permalink raw reply related
* [PATCH v2 2/7] nsp32: don't keep pci_device_id
From: Gary Guo @ 2026-06-30 11:09 UTC (permalink / raw)
To: Bjorn Helgaas, Zhenzhong Duan, Greg Kroah-Hartman,
Rafael J. Wysocki, Danilo Krummrich, Damien Le Moal,
Niklas Cassel, GOTO Masanori, YOKOTA Hiroshi,
James E.J. Bottomley, Martin K. Petersen, Vaibhav Gupta,
Jens Taprogge, Ido Schimmel, Petr Machata, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
Cc: linux-pci, driver-core, linux-kernel, linux-ide, linux-scsi,
industrypack-devel, netdev, Gary Guo
In-Reply-To: <20260630-pci_id_fix-v2-0-b834a98c0af2@garyguo.net>
pci_device_id is not guaranteed to live longer than probe due to presence
of dynamic ID. All information apart from driver_data can be easily
retrieved from pci_dev, so just store driver_data.
Signed-off-by: Gary Guo <gary@garyguo.net>
---
drivers/scsi/nsp32.c | 8 ++++----
drivers/scsi/nsp32.h | 8 ++++----
2 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/drivers/scsi/nsp32.c b/drivers/scsi/nsp32.c
index e893d5677241..9c9281222a0a 100644
--- a/drivers/scsi/nsp32.c
+++ b/drivers/scsi/nsp32.c
@@ -1470,7 +1470,7 @@ static int nsp32_show_info(struct seq_file *m, struct Scsi_Host *host)
(nsp32_read2(base, INDEX_REG) >> 8) & 0xff);
mode_reg = nsp32_index_read1(base, CHIP_MODE);
- model = data->pci_devid->driver_data;
+ model = data->model;
#ifdef CONFIG_PM
seq_printf(m, "Power Management: %s\n",
@@ -2907,8 +2907,8 @@ static int nsp32_eh_host_reset(struct scsi_cmnd *SCpnt)
*/
static int nsp32_getprom_param(nsp32_hw_data *data)
{
- int vendor = data->pci_devid->vendor;
- int device = data->pci_devid->device;
+ int vendor = data->Pci->vendor;
+ int device = data->Pci->device;
int ret, i;
int __maybe_unused val;
@@ -3340,7 +3340,7 @@ static int nsp32_probe(struct pci_dev *pdev, const struct pci_device_id *id)
}
data->Pci = pdev;
- data->pci_devid = id;
+ data->model = id->driver_data;
data->IrqNumber = pdev->irq;
data->BaseAddress = pci_resource_start(pdev, 0);
data->NumAddress = pci_resource_len (pdev, 0);
diff --git a/drivers/scsi/nsp32.h b/drivers/scsi/nsp32.h
index 924889f8bd37..9e65771cb592 100644
--- a/drivers/scsi/nsp32.h
+++ b/drivers/scsi/nsp32.h
@@ -564,10 +564,10 @@ typedef struct _nsp32_hw_data {
struct scsi_cmnd *CurrentSC;
- struct pci_dev *Pci;
- const struct pci_device_id *pci_devid;
- struct Scsi_Host *Host;
- spinlock_t Lock;
+ struct pci_dev *Pci;
+ int model;
+ struct Scsi_Host *Host;
+ spinlock_t Lock;
char info_str[100];
--
2.54.0
^ permalink raw reply related
* [PATCH v2 6/7] pci: fix dyn_id add TOCTOU
From: Gary Guo @ 2026-06-30 11:09 UTC (permalink / raw)
To: Bjorn Helgaas, Zhenzhong Duan, Greg Kroah-Hartman,
Rafael J. Wysocki, Danilo Krummrich, Damien Le Moal,
Niklas Cassel, GOTO Masanori, YOKOTA Hiroshi,
James E.J. Bottomley, Martin K. Petersen, Vaibhav Gupta,
Jens Taprogge, Ido Schimmel, Petr Machata, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
Cc: linux-pci, driver-core, linux-kernel, linux-ide, linux-scsi,
industrypack-devel, netdev, Gary Guo
In-Reply-To: <20260630-pci_id_fix-v2-0-b834a98c0af2@garyguo.net>
Currently there is a TOCTOU issue in new_id_store as the dyn ID insertion
in pci_add_dynid and the pci_match_device are in separate critical
sections.
Fix this by moving the existing ID check to inside pci_add_dynid and only
check against the static ID table outside the critical section.
Fixes: 3853f9123c18 ("PCI: Avoid duplicate IDs in driver dynamic IDs list")
Signed-off-by: Gary Guo <gary@garyguo.net>
---
drivers/pci/pci-driver.c | 139 ++++++++++++++++++++++++-----------------------
1 file changed, 71 insertions(+), 68 deletions(-)
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 0507cb801310..df1be7ea2bde 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -29,6 +29,48 @@ struct pci_dynid {
struct pci_device_id id;
};
+/**
+ * do_pci_add_dynid - add a new PCI device ID to this driver and re-probe devices
+ * @drv: target pci driver
+ * @id: ID to be added
+ * @check_dup: whether to check if matching ID is already present
+ *
+ * Adds a new dynamic pci device ID to this driver and causes the
+ * driver to probe for all devices again. @drv must have been
+ * registered prior to calling this function.
+ *
+ * CONTEXT:
+ * Does GFP_KERNEL allocation.
+ *
+ * RETURNS:
+ * 0 on success, -errno on failure.
+ */
+static int do_pci_add_dynid(struct pci_driver *drv, const struct pci_device_id *id, bool check_dup)
+{
+ struct pci_dynid *dynid, *existing_dynid;
+
+ dynid = kzalloc_obj(*dynid);
+ if (!dynid)
+ return -ENOMEM;
+
+ dynid->id = *id;
+
+ {
+ guard(spinlock)(&drv->dynids.lock);
+ if (check_dup) {
+ list_for_each_entry(existing_dynid, &drv->dynids.list, node) {
+ if (pci_match_one_id(&existing_dynid->id, id)) {
+ kfree(dynid);
+ return -EEXIST;
+ }
+ }
+ }
+ list_add_tail(&dynid->node, &drv->dynids.list);
+ }
+
+ return driver_attach(&drv->driver);
+}
+
/**
* pci_add_dynid - add a new PCI device ID to this driver and re-probe devices
* @drv: target pci driver
@@ -56,25 +98,17 @@ int pci_add_dynid(struct pci_driver *drv,
unsigned int class, unsigned int class_mask,
unsigned long driver_data)
{
- struct pci_dynid *dynid;
+ struct pci_device_id id = {
+ .vendor = vendor,
+ .device = device,
+ .subvendor = subvendor,
+ .subdevice = subdevice,
+ .class = class,
+ .class_mask = class_mask,
+ .driver_data = driver_data,
+ };
- dynid = kzalloc_obj(*dynid);
- if (!dynid)
- return -ENOMEM;
-
- dynid->id.vendor = vendor;
- dynid->id.device = device;
- dynid->id.subvendor = subvendor;
- dynid->id.subdevice = subdevice;
- dynid->id.class = class;
- dynid->id.class_mask = class_mask;
- dynid->id.driver_data = driver_data;
-
- spin_lock(&drv->dynids.lock);
- list_add_tail(&dynid->node, &drv->dynids.list);
- spin_unlock(&drv->dynids.lock);
-
- return driver_attach(&drv->driver);
+ return do_pci_add_dynid(drv, &id, false);
}
EXPORT_SYMBOL_GPL(pci_add_dynid);
@@ -99,11 +133,13 @@ static void pci_free_dynids(struct pci_driver *drv)
* %NULL if there is no match.
*/
static const struct pci_device_id *do_pci_match_id(const struct pci_device_id *ids,
- const struct pci_device_id *dev_id)
+ const struct pci_device_id *dev_id,
+ bool match_override_only)
{
if (ids) {
while (ids->vendor || ids->subvendor || ids->class_mask) {
- if (pci_match_one_id(ids, dev_id))
+ if ((!ids->override_only || match_override_only) &&
+ pci_match_one_id(ids, dev_id))
return ids;
ids++;
}
@@ -128,7 +164,7 @@ const struct pci_device_id *pci_match_id(const struct pci_device_id *ids,
{
struct pci_device_id dev_id = pci_id_from_device(dev);
- return do_pci_match_id(ids, &dev_id);
+ return do_pci_match_id(ids, &dev_id, true);
}
EXPORT_SYMBOL(pci_match_id);
@@ -153,7 +189,7 @@ static const struct pci_device_id *pci_match_device(struct pci_driver *drv,
struct pci_dev *dev)
{
struct pci_dynid *dynid;
- const struct pci_device_id *found_id = NULL, *ids;
+ const struct pci_device_id *found_id = NULL;
struct pci_device_id dev_id;
int ret;
@@ -176,20 +212,9 @@ static const struct pci_device_id *pci_match_device(struct pci_driver *drv,
if (found_id)
return found_id;
- for (ids = drv->id_table; (found_id = do_pci_match_id(ids, &dev_id));
- ids = found_id + 1) {
- /*
- * The match table is split based on driver_override.
- * In case override_only was set, enforce driver_override
- * matching.
- */
- if (found_id->override_only) {
- if (ret > 0)
- return found_id;
- } else {
- return found_id;
- }
- }
+ found_id = do_pci_match_id(drv->id_table, &dev_id, ret > 0);
+ if (found_id)
+ return found_id;
/* driver_override will always match, send a dummy id */
if (ret > 0)
@@ -197,11 +222,6 @@ static const struct pci_device_id *pci_match_device(struct pci_driver *drv,
return NULL;
}
-static void _pci_free_device(struct device *dev)
-{
- kfree(to_pci_dev(dev));
-}
-
/**
* new_id_store - sysfs frontend to pci_add_dynid()
* @driver: target device driver
@@ -215,38 +235,22 @@ static ssize_t new_id_store(struct device_driver *driver, const char *buf,
{
struct pci_driver *pdrv = to_pci_driver(driver);
const struct pci_device_id *ids = pdrv->id_table;
- u32 vendor, device, subvendor = PCI_ANY_ID,
- subdevice = PCI_ANY_ID, class = 0, class_mask = 0;
- unsigned long driver_data = 0;
+ struct pci_device_id id = {
+ .subvendor = PCI_ANY_ID,
+ .subdevice = PCI_ANY_ID
+ };
int fields;
int retval = 0;
fields = sscanf(buf, "%x %x %x %x %x %x %lx",
- &vendor, &device, &subvendor, &subdevice,
- &class, &class_mask, &driver_data);
+ &id.vendor, &id.device, &id.subvendor, &id.subdevice,
+ &id.class, &id.class_mask, &id.driver_data);
if (fields < 2)
return -EINVAL;
if (fields != 7) {
- struct pci_dev *pdev = kzalloc_obj(*pdev);
- if (!pdev)
- return -ENOMEM;
-
- pdev->vendor = vendor;
- pdev->device = device;
- pdev->subsystem_vendor = subvendor;
- pdev->subsystem_device = subdevice;
- pdev->class = class;
- pdev->dev.release = _pci_free_device;
-
- device_initialize(&pdev->dev);
- if (pci_match_device(pdrv, pdev))
- retval = -EEXIST;
-
- put_device(&pdev->dev);
-
- if (retval)
- return retval;
+ if (do_pci_match_id(pdrv->id_table, &id, false))
+ return -EEXIST;
}
/* Only accept driver_data values that match an existing id_table
@@ -254,7 +258,7 @@ static ssize_t new_id_store(struct device_driver *driver, const char *buf,
if (ids) {
retval = -EINVAL;
while (ids->vendor || ids->subvendor || ids->class_mask) {
- if (driver_data == ids->driver_data) {
+ if (id.driver_data == ids->driver_data) {
retval = 0;
break;
}
@@ -264,8 +268,7 @@ static ssize_t new_id_store(struct device_driver *driver, const char *buf,
return retval;
}
- retval = pci_add_dynid(pdrv, vendor, device, subvendor, subdevice,
- class, class_mask, driver_data);
+ retval = do_pci_add_dynid(pdrv, &id, fields != 7);
if (retval)
return retval;
return count;
--
2.54.0
^ permalink raw reply related
* Re: [PATCH bpf-next v2] selftests/bpf: Mask socket type flags in mptcpify prog
From: Matthieu Baerts @ 2026-06-30 11:09 UTC (permalink / raw)
To: Guillaume Maudoux
Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Matthieu Baerts,
Mat Martineau, Geliang Tang, Shuah Khan, bpf, mptcp, netdev,
linux-kselftest, linux-kernel
In-Reply-To: <20260630095723.564392-1-layus.on@gmail.com>
Hi Guillaume,
Thank you for sharing this fix!
> The mptcpify BPF prog upgrades eligible TCP sockets to MPTCP, but only
> when the socket type is exactly SOCK_STREAM. Its update_socket_protocol()
> hook runs on the raw type from userspace, before the socket core masks
> it with SOCK_TYPE_MASK, so the type may still carry SOCK_CLOEXEC or
> SOCK_NONBLOCK in its upper bits and the equality check fails.
>
> As a result, a socket(AF_INET, SOCK_STREAM | SOCK_CLOEXEC, 0) -- what
> common libraries do by default -- is silently left as plain TCP. This
> was hit in practice with curl. Since mptcpify.c is referenced as example
> code for enabling MPTCP transparently, the same mistake is likely to be
> copied into real deployments where it fails the same way and is hard to
> diagnose.
Good idea to fix this code in mptcpify.c, which is indeed used as a
reference, and ends up in real deployments.
The modifications look good to me:
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
--
Matthieu Baerts (NGI0) <matttbe@kernel.org>
^ permalink raw reply
* [PATCH v2 0/7] pci: fix UAF and TOCTOU related to dynamic ID
From: Gary Guo @ 2026-06-30 11:09 UTC (permalink / raw)
To: Bjorn Helgaas, Zhenzhong Duan, Greg Kroah-Hartman,
Rafael J. Wysocki, Danilo Krummrich, Damien Le Moal,
Niklas Cassel, GOTO Masanori, YOKOTA Hiroshi,
James E.J. Bottomley, Martin K. Petersen, Vaibhav Gupta,
Jens Taprogge, Ido Schimmel, Petr Machata, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
Cc: linux-pci, driver-core, linux-kernel, linux-ide, linux-scsi,
industrypack-devel, netdev, Gary Guo, Sashiko
While working on improving the Rust abstractions [1], Sashiko reported that
an existing UAF issue related to dynamic ID, which I find to be genuine.
When taking a look at the code I also find a TOCTOU issue where the
existence check of dynamic ID happens in a separate critical section as the
actual insertion. This series fix both issues.
There are two exported functions "pci_match_id" and "pci_add_dynid" which I
have to tweak to implement this cleanly; I created separate "do_xxx"
functions to keep the existing APIs because they all have multiple users.
There're a few existing users which stores their pci_device_id argument in
probe callback. This is a bad pattern because nothing except driver_data
inside pci_device_id is what they want. Actual ID information can be
retrieved from pci_dev instead. I've used the following coccinelle script
to find the cases where the argument is stored and converted them to stop
storing pci_device_id.
@store@
identifier fn;
identifier id;
expression E;
parameter list[n] ps;
@@
fn(ps, struct pci_device_id *id, ...)
{
...
* E = id
...
}
@cast@
identifier fn;
identifier id;
parameter list[n] ps;
@@
fn(ps, struct pci_device_id *id, ...)
{
...
* (void *)id
...
}
@in_struct@
identifier s, fld;
@@
struct s {
...
* struct pci_device_id *fld;
...
};
Link: https://lore.kernel.org/all/20260618-id_info-v1-0-96af1e559ef9@garyguo.net/ [1]
Link: https://lore.kernel.org/all/20260619170503.518F61F00A3A@smtp.kernel.org/ [2]
---
Changes in v2:
- Fix users which store pci_device_id.
- Clarify in probe documentation about the lifetime of pci_device_id
parameter.
- Dynamic ID conflict check now ignores override_only. (Sashiko)
- Link to v1: https://patch.msgid.link/20260626-pci_id_fix-v1-0-a35c803f1b95@garyguo.net
---
Gary Guo (7):
ata: don't keep pci_device_id
nsp32: don't keep pci_device_id
ipack: tpci200: don't keep pci_device_id
mlxsw: don't keep pci_device_id
pci: make pci_match_one_device match on ID instead of device
pci: fix dyn_id add TOCTOU
pci: fix UAF when probe runs concurrent to dyn ID removal
drivers/ata/ata_generic.c | 6 +-
drivers/ipack/carriers/tpci200.c | 1 -
drivers/ipack/carriers/tpci200.h | 1 -
drivers/net/ethernet/mellanox/mlxsw/pci.c | 11 +-
drivers/pci/pci-driver.c | 219 ++++++++++++++++--------------
drivers/pci/pci.h | 36 +++--
drivers/pci/search.c | 6 +-
drivers/scsi/nsp32.c | 8 +-
drivers/scsi/nsp32.h | 8 +-
include/linux/pci.h | 1 +
10 files changed, 166 insertions(+), 131 deletions(-)
---
base-commit: dc59e4fea9d83f03bad6bddf3fa2e52491777482
change-id: 20260626-pci_id_fix-83eaec007674
Best regards,
--
Gary Guo <gary@garyguo.net>
^ permalink raw reply
* [PATCH v2 5/7] pci: make pci_match_one_device match on ID instead of device
From: Gary Guo @ 2026-06-30 11:09 UTC (permalink / raw)
To: Bjorn Helgaas, Zhenzhong Duan, Greg Kroah-Hartman,
Rafael J. Wysocki, Danilo Krummrich, Damien Le Moal,
Niklas Cassel, GOTO Masanori, YOKOTA Hiroshi,
James E.J. Bottomley, Martin K. Petersen, Vaibhav Gupta,
Jens Taprogge, Ido Schimmel, Petr Machata, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
Cc: linux-pci, driver-core, linux-kernel, linux-ide, linux-scsi,
industrypack-devel, netdev, Gary Guo
In-Reply-To: <20260630-pci_id_fix-v2-0-b834a98c0af2@garyguo.net>
There is a need to match just IDs instead of against devices. Thus rename
this function to pci_match_one_id, and add a pci_id_from_device helper to
make it easy to convert users.
Similar convert pci_match_id to do_pci_match_id, however the existing API
is kept due to quite a few users.
Signed-off-by: Gary Guo <gary@garyguo.net>
---
drivers/pci/pci-driver.c | 38 ++++++++++++++++++++++++++++----------
drivers/pci/pci.h | 36 ++++++++++++++++++++++++++----------
drivers/pci/search.c | 6 ++++--
3 files changed, 58 insertions(+), 22 deletions(-)
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index f36778e62ac1..0507cb801310 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -90,6 +90,27 @@ static void pci_free_dynids(struct pci_driver *drv)
spin_unlock(&drv->dynids.lock);
}
+/**
+ * do_pci_match_id - See if a PCI ID matches a given pci_id table
+ * @ids: array of PCI device ID structures to search in
+ * @dev_id: the actual PCI device ID structure to match against.
+ *
+ * Returns the matching pci_device_id structure or
+ * %NULL if there is no match.
+ */
+static const struct pci_device_id *do_pci_match_id(const struct pci_device_id *ids,
+ const struct pci_device_id *dev_id)
+{
+ if (ids) {
+ while (ids->vendor || ids->subvendor || ids->class_mask) {
+ if (pci_match_one_id(ids, dev_id))
+ return ids;
+ ids++;
+ }
+ }
+ return NULL;
+}
+
/**
* pci_match_id - See if a PCI device matches a given pci_id table
* @ids: array of PCI device ID structures to search in
@@ -105,14 +126,9 @@ static void pci_free_dynids(struct pci_driver *drv)
const struct pci_device_id *pci_match_id(const struct pci_device_id *ids,
struct pci_dev *dev)
{
- if (ids) {
- while (ids->vendor || ids->subvendor || ids->class_mask) {
- if (pci_match_one_device(ids, dev))
- return ids;
- ids++;
- }
- }
- return NULL;
+ struct pci_device_id dev_id = pci_id_from_device(dev);
+
+ return do_pci_match_id(ids, &dev_id);
}
EXPORT_SYMBOL(pci_match_id);
@@ -138,6 +154,7 @@ static const struct pci_device_id *pci_match_device(struct pci_driver *drv,
{
struct pci_dynid *dynid;
const struct pci_device_id *found_id = NULL, *ids;
+ struct pci_device_id dev_id;
int ret;
/* When driver_override is set, only bind to the matching driver */
@@ -145,10 +162,11 @@ static const struct pci_device_id *pci_match_device(struct pci_driver *drv,
if (ret == 0)
return NULL;
+ dev_id = pci_id_from_device(dev);
/* Look at the dynamic ids first, before the static ones */
spin_lock(&drv->dynids.lock);
list_for_each_entry(dynid, &drv->dynids.list, node) {
- if (pci_match_one_device(&dynid->id, dev)) {
+ if (pci_match_one_id(&dynid->id, &dev_id)) {
found_id = &dynid->id;
break;
}
@@ -158,7 +176,7 @@ static const struct pci_device_id *pci_match_device(struct pci_driver *drv,
if (found_id)
return found_id;
- for (ids = drv->id_table; (found_id = pci_match_id(ids, dev));
+ for (ids = drv->id_table; (found_id = do_pci_match_id(ids, &dev_id));
ids = found_id + 1) {
/*
* The match table is split based on driver_override.
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 4469e1a77f3c..0567a8762baa 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -442,21 +442,37 @@ static inline int pci_setup_cardbus(char *str) { return -ENOENT; }
#endif /* CONFIG_CARDBUS */
/**
- * pci_match_one_device - Tell if a PCI device structure has a matching
- * PCI device id structure
- * @id: single PCI device id structure to match
- * @dev: the PCI device structure to match against
+ * pci_id_from_device - Obtain a pci_device_id from a PCI device
+ * @dev: the PCI device
+ *
+ * Returns a pci_device_id filled.
+ */
+static inline struct pci_device_id pci_id_from_device(const struct pci_dev *dev)
+{
+ return (struct pci_device_id) {
+ .vendor = dev->vendor,
+ .device = dev->device,
+ .subvendor = dev->subsystem_vendor,
+ .subdevice = dev->subsystem_device,
+ .class = dev->class,
+ };
+}
+
+/**
+ * pci_match_one_id - Tell if a PCI device ID matches a needle PCI device id
+ * @id: single PCI device id structure to match against (needle)
+ * @dev_id: the actual ID from the PCI device (can be created via pci_id_from_device)
*
* Returns the matching pci_device_id structure or %NULL if there is no match.
*/
static inline const struct pci_device_id *
-pci_match_one_device(const struct pci_device_id *id, const struct pci_dev *dev)
+pci_match_one_id(const struct pci_device_id *id, const struct pci_device_id *dev_id)
{
- if ((id->vendor == PCI_ANY_ID || id->vendor == dev->vendor) &&
- (id->device == PCI_ANY_ID || id->device == dev->device) &&
- (id->subvendor == PCI_ANY_ID || id->subvendor == dev->subsystem_vendor) &&
- (id->subdevice == PCI_ANY_ID || id->subdevice == dev->subsystem_device) &&
- !((id->class ^ dev->class) & id->class_mask))
+ if ((id->vendor == PCI_ANY_ID || id->vendor == dev_id->vendor) &&
+ (id->device == PCI_ANY_ID || id->device == dev_id->device) &&
+ (id->subvendor == PCI_ANY_ID || id->subvendor == dev_id->subvendor) &&
+ (id->subdevice == PCI_ANY_ID || id->subdevice == dev_id->subdevice) &&
+ !((id->class ^ dev_id->class) & id->class_mask))
return id;
return NULL;
}
diff --git a/drivers/pci/search.c b/drivers/pci/search.c
index e3d3177fce54..c8c4bfe7817b 100644
--- a/drivers/pci/search.c
+++ b/drivers/pci/search.c
@@ -245,8 +245,10 @@ static int match_pci_dev_by_id(struct device *dev, const void *data)
{
struct pci_dev *pdev = to_pci_dev(dev);
const struct pci_device_id *id = data;
+ struct pci_device_id dev_id;
- if (pci_match_one_device(id, pdev))
+ dev_id = pci_id_from_device(pdev);
+ if (pci_match_one_id(id, &dev_id))
return 1;
return 0;
}
@@ -418,7 +420,7 @@ EXPORT_SYMBOL(pci_get_class);
*
* Iterates through the list of known PCI devices. If a PCI device is found
* with a matching base class code, the reference count to the device is
- * incremented. See pci_match_one_device() to figure out how does this works.
+ * incremented. See pci_match_one_id() to figure out how does this works.
* A new search is initiated by passing %NULL as the @from argument.
* Otherwise if @from is not %NULL, searches continue from next device on the
* global list. The reference count for @from is always decremented if it is
--
2.54.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox