* Re: [PATCH v6 01/10] dt-bindings: display: rockchip: analogix-dp: Fix hclk as third clock for RK3588
From: Krzysztof Kozlowski @ 2026-05-22 6:11 UTC (permalink / raw)
To: Damon Ding
Cc: hjc, heiko, andy.yan, maarten.lankhorst, mripard, tzimmermann,
airlied, simona, robh, krzk+dt, conor+dt, andrzej.hajda,
neil.armstrong, rfoss, Laurent.pinchart, jonas, jernej.skrabec,
nicolas.frattaroli, cristian.ciocaltea, sebastian.reichel,
dmitry.baryshkov, luca.ceresoli, dianders, m.szyprowski,
dri-devel, devicetree, linux-arm-kernel, linux-rockchip,
linux-kernel
In-Reply-To: <20260521080835.1362416-2-damon.ding@rock-chips.com>
On Thu, May 21, 2026 at 04:08:26PM +0800, Damon Ding wrote:
> RK3588 eDP controller requires HCLK_VO1 to access the VO1 GRF
> registers and enable the video datapath.
>
> Previously, the clock was enabled implicitly via the 'rockchip,vo-grf'
> phandle reference, which allowed the eDP to work without explicitly
> managing the hclk_vo1 clock. However, this is not safe or explicit.
>
> To make the clock dependency explicit, enforce per-SoC clock-names
> requirements:
> - RK3288: 2 clocks (dp, pclk)
> - RK3399: 3 clocks (dp, pclk, grf)
> - RK3588: 3 clocks (dp, pclk, hclk)
>
> Do not reuse the 'grf' clock name for RK3588 because it represents
> a different clock with distinct control logic:
> - The 'grf' clock is only for GRF register access and is toggled
> dynamically during register access.
> - The 'hclk' clock controls both GRF access and video datapath
> gating, and must remain enabled during probe.
>
> Fixes: f855146263b1 ("dt-bindings: display: rockchip: analogix-dp: Add support for RK3588")
> Signed-off-by: Damon Ding <damon.ding@rock-chips.com>
>
> ---
>
> Changes in v4:
> - Modify the commit msg.
>
> Changes in v5:
> - Enforce the correct third clock name on a per-compatible basis.
> - Modify the commit msg simultaneously.
>
> Changes in v6:
> - Expand more detail commit msg about using hclk instead of grf clock.
> ---
> .../rockchip/rockchip,analogix-dp.yaml | 37 +++++++++++++++++--
> 1 file changed, 33 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/devicetree/bindings/display/rockchip/rockchip,analogix-dp.yaml b/Documentation/devicetree/bindings/display/rockchip/rockchip,analogix-dp.yaml
> index d99b23b88cc5..8001c1facf98 100644
> --- a/Documentation/devicetree/bindings/display/rockchip/rockchip,analogix-dp.yaml
> +++ b/Documentation/devicetree/bindings/display/rockchip/rockchip,analogix-dp.yaml
> @@ -23,10 +23,7 @@ properties:
>
> clock-names:
> minItems: 2
> - items:
> - - const: dp
> - - const: pclk
> - - const: grf
> + maxItems: 3
>
> power-domains:
> maxItems: 1
> @@ -60,6 +57,33 @@ required:
> allOf:
> - $ref: /schemas/display/bridge/analogix,dp.yaml#
>
> + - if:
> + properties:
> + compatible:
> + contains:
> + enum:
> + - rockchip,rk3288-dp
> + then:
> + properties:
> + clock-names:
> + items:
> + - const: dp
> + - const: pclk
Why aren't there constraints for clocks? They always must come together.
Please open any other binding and look how it is done there.
Best regards,
Krzysztof
^ permalink raw reply
* [PATCH] irqchip/exynos-combiner: remove useless spinlock
From: Marek Szyprowski @ 2026-05-22 6:10 UTC (permalink / raw)
To: linux-arm-kernel, linux-samsung-soc, linux-rt-devel
Cc: Marek Szyprowski, Thomas Gleixner, Krzysztof Kozlowski,
Alim Akhtar, Sebastian Andrzej Siewior, Clark Williams,
Steven Rostedt
In-Reply-To: <CGME20260522061030eucas1p15f61b7bc01f77a3d34d41472210ea662@eucas1p1.samsung.com>
irq_controller_lock doesn't protect anything, it must be some leftover
from early development or copy/paste. Remove it completely.
Suggested-by: Thomas Gleixner <tglx@kernel.org>
Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/all/20260520220422.3522908-1-m.szyprowski@samsung.com/
Fixes: 96031b31a4b3 ("irqchip/exynos-combiner: Switch to raw_spinlock")
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
---
drivers/irqchip/exynos-combiner.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/drivers/irqchip/exynos-combiner.c b/drivers/irqchip/exynos-combiner.c
index 03cafcc5c835..d9d408cb4711 100644
--- a/drivers/irqchip/exynos-combiner.c
+++ b/drivers/irqchip/exynos-combiner.c
@@ -24,8 +24,6 @@
#define IRQ_IN_COMBINER 8
-static DEFINE_RAW_SPINLOCK(irq_controller_lock);
-
struct combiner_chip_data {
unsigned int hwirq_offset;
unsigned int irq_mask;
@@ -72,9 +70,7 @@ static void combiner_handle_cascade_irq(struct irq_desc *desc)
chained_irq_enter(chip, desc);
- raw_spin_lock(&irq_controller_lock);
status = readl_relaxed(chip_data->base + COMBINER_INT_STATUS);
- raw_spin_unlock(&irq_controller_lock);
status &= chip_data->irq_mask;
if (status == 0)
--
2.34.1
^ permalink raw reply related
* Re: [PATCH v9 2/5] dt-bindings: phy: Add documentation for Airoha AN7581 USB PHY
From: Krzysztof Kozlowski @ 2026-05-22 6:06 UTC (permalink / raw)
To: Christian Marangi
Cc: Michael Turquette, Stephen Boyd, Brian Masney, Rob Herring,
Krzysztof Kozlowski, Conor Dooley, Vinod Koul, Neil Armstrong,
Lorenzo Bianconi, Felix Fietkau, linux-clk, devicetree,
linux-kernel, linux-arm-kernel, linux-phy
In-Reply-To: <20260521153645.7028-3-ansuelsmth@gmail.com>
On Thu, May 21, 2026 at 05:35:53PM +0200, Christian Marangi wrote:
> Add documentation for Airoha AN7581 USB PHY that describe the USB PHY
> for the USB controller.
>
> Airoha AN7581 SoC support a maximum of 2 USB port. The USB 2.0 mode is
> always supported. The USB 3.0 mode is optional and depends on the Serdes
> mode currently configured on the system for the relevant USB port.
>
> To correctly calibrate, the USB 2.0 port require correct value in
> "airoha,usb2-monitor-clk-sel" property. Both the 2 USB 2.0 port permit
> selecting one of the 4 monitor clock for calibration (internal clock not
> exposed to the system) but each port have only one of the 4 actually
> connected in HW hence the correct value needs to be specified in DT
> based on board and the physical port. Normally it's monitor clock 1 for
> USB1 and monitor clock 2 for USB2.
>
> To correctly setup the Serdes mode attached to the USB 3.0 mode, a phys
> property is required with the phandle pointing to the correct Serdes port
> provided by the SCU node. Providing the phys property is optional if USB
> 3.0 is not used.
>
> Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
> ---
> .../bindings/phy/airoha,an7581-usb-phy.yaml | 62 +++++++++++++++++++
> MAINTAINERS | 6 ++
> 2 files changed, 68 insertions(+)
> create mode 100644 Documentation/devicetree/bindings/phy/airoha,an7581-usb-phy.yaml
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Best regards,
Krzysztof
^ permalink raw reply
* RE: [PATCH 1/2] PCI: host-generic: Simplify return value handling in pci_host_common_parse_port(s)
From: Hongxing Zhu @ 2026-05-22 6:05 UTC (permalink / raw)
To: Sherry Sun (OSS), l.stach@pengutronix.de, Frank Li,
bhelgaas@google.com, lpieralisi@kernel.org,
kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org,
s.hauer@pengutronix.de, kernel@pengutronix.de, festevam@gmail.com,
will@kernel.org
Cc: imx@lists.linux.dev, linux-pci@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, Sherry Sun
In-Reply-To: <20260522034344.1147775-2-sherry.sun@oss.nxp.com>
> -----Original Message-----
> From: Sherry Sun (OSS) <sherry.sun@oss.nxp.com>
> Sent: Friday, May 22, 2026 11:44 AM
> To: Hongxing Zhu <hongxing.zhu@nxp.com>; l.stach@pengutronix.de; Frank Li
> <frank.li@nxp.com>; bhelgaas@google.com; lpieralisi@kernel.org;
> kwilczynski@kernel.org; mani@kernel.org; robh@kernel.org;
> s.hauer@pengutronix.de; kernel@pengutronix.de; festevam@gmail.com;
> will@kernel.org
> Cc: imx@lists.linux.dev; linux-pci@vger.kernel.org; linux-arm-
> kernel@lists.infradead.org; linux-kernel@vger.kernel.org; Sherry Sun
> <sherry.sun@nxp.com>
> Subject: [PATCH 1/2] PCI: host-generic: Simplify return value handling in
> pci_host_common_parse_port(s)
>
> From: Sherry Sun <sherry.sun@nxp.com>
>
> The pci_host_common_parse_port() shouldn't check the RC-level binding.
> That's a policy decision that belongs to the caller, not this common helper.
>
> Simplify pci_host_common_parse_port() to only parses properties from the Root
"to only parses"/s/"to only parse"
> Port(and its children) without checking the RC node. Also change
Missing space after "Port".
> pci_host_common_parse_ports() to return 0 when no ports are found, since it is
> not an error.
>
> So now both functions won't return failure for "property not found" or "port not
> found", they purely returns 0 on success and a negative error code on real
returns/s/return
Best Regards
Richard Zhu
> failures.
>
> Signed-off-by: Sherry Sun <sherry.sun@nxp.com>
> ---
> drivers/pci/controller/pci-host-common.c | 29 ++++--------------------
> 1 file changed, 5 insertions(+), 24 deletions(-)
>
> diff --git a/drivers/pci/controller/pci-host-common.c b/drivers/pci/controller/pci-
> host-common.c
> index 67455caaf9e6..7ce5a939e3d9 100644
> --- a/drivers/pci/controller/pci-host-common.c
> +++ b/drivers/pci/controller/pci-host-common.c
> @@ -108,8 +108,7 @@ static int pci_host_common_parse_perst(struct device
> *dev,
> * dependencies and the driver may fail to operate if required resources
> * are missing.
> *
> - * Return: 0 on success, -ENODEV if PERST# found in RC node (legacy binding
> - * should be used), Other negative error codes on failure.
> + * Return: 0 on success, negative error codes on failure.
> */
> static int pci_host_common_parse_port(struct device *dev,
> struct pci_host_bridge *bridge, @@ -128,22
> +127,6 @@ static int pci_host_common_parse_port(struct device *dev,
> if (ret)
> return ret;
>
> - /*
> - * 1. PERST# found in RP or its child nodes - list is not empty,
> - * continue
> - *
> - * 2. PERST# not found in RP/children, but found in RC node -
> - * return -ENODEV to fallback legacy binding
> - *
> - * 3. PERST# not found anywhere - list is empty, continue (optional
> - * PERST#)
> - */
> - if (list_empty(&port->perst)) {
> - if (of_property_present(dev->of_node, "reset-gpios") ||
> - of_property_present(dev->of_node, "reset-gpio"))
> - return -ENODEV;
> - }
> -
> INIT_LIST_HEAD(&port->list);
> list_add_tail(&port->list, &bridge->ports);
>
> @@ -158,13 +141,11 @@ static int pci_host_common_parse_port(struct device
> *dev,
> * Iterate through child nodes of the host bridge and parse Root Port
> * properties (currently only reset GPIOs).
> *
> - * Return: 0 on success, -ENODEV if no ports found or PERST# found in RC
> - * node (legacy binding should be used), Other negative error codes on
> - * failure.
> + * Return: 0 on success, negative error codes on failure.
> */
> int pci_host_common_parse_ports(struct device *dev, struct pci_host_bridge
> *bridge) {
> - int ret = -ENODEV;
> + int ret = 0;
>
> for_each_available_child_of_node_scoped(dev->of_node, of_port) {
> if (!of_node_is_type(of_port, "pci")) @@ -174,8 +155,8 @@ int
> pci_host_common_parse_ports(struct device *dev, struct pci_host_bridge *brid
> goto err_cleanup;
> }
>
> - if (ret)
> - return ret;
> + if (list_empty(&bridge->ports))
> + return 0;
>
> return devm_add_action_or_reset(dev, pci_host_common_delete_ports,
> &bridge->ports);
> --
> 2.37.1
^ permalink raw reply
* RE: [PATCH 2/2] PCI: imx6: Add imx_pcie_perst_found() to inspect the parsed result
From: Hongxing Zhu @ 2026-05-22 6:05 UTC (permalink / raw)
To: Sherry Sun (OSS), l.stach@pengutronix.de, Frank Li,
bhelgaas@google.com, lpieralisi@kernel.org,
kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org,
s.hauer@pengutronix.de, kernel@pengutronix.de, festevam@gmail.com,
will@kernel.org
Cc: imx@lists.linux.dev, linux-pci@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, Sherry Sun
In-Reply-To: <20260522034344.1147775-3-sherry.sun@oss.nxp.com>
> -----Original Message-----
> From: Sherry Sun (OSS) <sherry.sun@oss.nxp.com>
> Sent: Friday, May 22, 2026 11:44 AM
> To: Hongxing Zhu <hongxing.zhu@nxp.com>; l.stach@pengutronix.de; Frank Li
> <frank.li@nxp.com>; bhelgaas@google.com; lpieralisi@kernel.org;
> kwilczynski@kernel.org; mani@kernel.org; robh@kernel.org;
> s.hauer@pengutronix.de; kernel@pengutronix.de; festevam@gmail.com;
> will@kernel.org
> Cc: imx@lists.linux.dev; linux-pci@vger.kernel.org; linux-arm-
> kernel@lists.infradead.org; linux-kernel@vger.kernel.org; Sherry Sun
> <sherry.sun@nxp.com>
> Subject: [PATCH 2/2] PCI: imx6: Add imx_pcie_perst_found() to inspect the
> parsed result
>
> From: Sherry Sun <sherry.sun@nxp.com>
>
> Since pci_host_common_parse_port() doesn't return failure for "property not
> found"(-ENODEV), the caller should inspect the parsed result and decide whether
One space should be placed before (-ENODEV).
> to fall back to the legacy binding.
> Add imx_pcie_perst_found() to inspect the parsed result.
>
> Signed-off-by: Sherry Sun <sherry.sun@nxp.com>
Reviewed-by: Richard Zhu <hongxing.zhu@nxp.com>
Best Regards
Richard Zhu
> ---
> drivers/pci/controller/dwc/pci-imx6.c | 25 +++++++++++++++++--------
> 1 file changed, 17 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/pci/controller/dwc/pci-imx6.c
> b/drivers/pci/controller/dwc/pci-imx6.c
> index b137551871fc..34756f28fcc6 100644
> --- a/drivers/pci/controller/dwc/pci-imx6.c
> +++ b/drivers/pci/controller/dwc/pci-imx6.c
> @@ -1287,6 +1287,18 @@ static void imx_pcie_assert_perst(struct imx_pcie
> *imx_pcie, bool assert)
> }
> }
>
> +static bool imx_pcie_perst_found(struct pci_host_bridge *bridge) {
> + struct pci_host_port *port;
> +
> + list_for_each_entry(port, &bridge->ports, list) {
> + if (!list_empty(&port->perst))
> + return true;
> + }
> +
> + return false;
> +}
> +
> static int imx_pcie_host_init(struct dw_pcie_rp *pp) {
> struct dw_pcie *pci = to_dw_pcie_from_pp(pp); @@ -1299,15 +1311,12
> @@ static int imx_pcie_host_init(struct dw_pcie_rp *pp)
> /* Parse Root Port nodes if present */
> ret = pci_host_common_parse_ports(dev, bridge);
> if (ret) {
> - if (ret != -ENODEV) {
> - dev_err(dev, "Failed to parse Root Port
> nodes: %d\n", ret);
> - return ret;
> - }
> + dev_err(dev, "Failed to parse Root Port nodes: %d\n",
> ret);
> + return ret;
> + }
>
> - /*
> - * Fall back to legacy binding for DT backwards
> - * compatibility
> - */
> + /* Fallback to legacy binding for DT backwards compatibility. */
> + if (!imx_pcie_perst_found(bridge)) {
> ret = imx_pcie_parse_legacy_binding(imx_pcie);
> if (ret)
> return ret;
> --
> 2.37.1
^ permalink raw reply
* Re: [PATCH v10 19/30] KVM: arm64: Provide assembly for SME register access
From: Marc Zyngier @ 2026-05-22 5:52 UTC (permalink / raw)
To: Mark Rutland
Cc: Mark Brown, Oliver Upton, Joey Gouly, Catalin Marinas,
Suzuki K Poulose, Will Deacon, Paolo Bonzini, Jonathan Corbet,
Shuah Khan, Dave Martin, Fuad Tabba, Ben Horgan, linux-arm-kernel,
kvmarm, linux-kernel, kvm, linux-doc, linux-kselftest,
Peter Maydell, Eric Auger
In-Reply-To: <ag8b7oq4SFpdmlP_@J2N7QTR9R3>
On Thu, 21 May 2026 15:51:26 +0100,
Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Fri, Mar 06, 2026 at 05:01:11PM +0000, Mark Brown wrote:
> > Provide versions of the SME state save and restore functions for the
> > hypervisor to allow it to restore ZA and ZT for guests.
> >
> > Signed-off-by: Mark Brown <broonie@kernel.org>
> > ---
> > arch/arm64/include/asm/kvm_hyp.h | 2 ++
> > arch/arm64/kvm/hyp/fpsimd.S | 23 +++++++++++++++++++++++
> > 2 files changed, 25 insertions(+)
>
> While this specific instance is simple enough, I don't think we should
> continue to duplicate the low level save/restore routines between the
> main kernel and KVM hyp code.
>
> I've sent a series that avoids the need for this, and cleans up some
> other bits):
>
> https://lore.kernel.org/linux-arm-kernel/20260521132556.584676-1-mark.rutland@arm.com/
>
> Assuming Marc and Oliver are on board, I'd prefer that we do that
> cleanup first, and build the KVM SME support atop.
Absolutely. The whole FP/SVE is still way too complicated, full of
esoteric constructs, hard to audit, and I would really like to see it
cleaned-up before stacking another layer on top.
I've quickly eyeballed the KVM-specific patches yesterday, and nothing
seem outlandish, so there is a good chance some of that could make it
into 7.2. I plan to look at it again shortly.
Thanks,
M.
--
Jazz isn't dead. It just smells funny.
^ permalink raw reply
* Re: [PATCH] arm64: tlb: Flush walk cache when unsharing PMD tables
From: Zeng Heng @ 2026-05-22 5:32 UTC (permalink / raw)
To: Catalin Marinas
Cc: yezhenyu2, zhurui3, will, akpm, npiggin, aneesh.kumar, peterz,
linux-kernel, wangkefeng.wang, linux-arm-kernel, linux-mm,
linux-arch, David Hildenbrand, zengheng4
In-Reply-To: <ag8hmZ3kjtJINyIT@arm.com>
On 2026/5/21 23:15, Catalin Marinas wrote:
> On Thu, May 21, 2026 at 04:05:07PM +0100, Catalin Marinas wrote:
>> + David H.
>>
>> On Thu, May 21, 2026 at 03:30:11PM +0800, Zeng Heng wrote:
>>> From: Zeng Heng <zengheng4@huawei.com>
>>>
>>> When huge_pmd_unshare() is called to unshare a PMD table, the
>>> tlb_unshare_pmd_ptdesc() function sets tlb->unshared_tables=true
>>> but the aarch64 tlb_flush() only checked tlb->freed_tables to
>>> determine whether to use TLBF_NONE (vae1is, invalidates walk
>>> cache) or TLBF_NOWALKCACHE (vale1is, leaf-only).
>>>
>>> This caused the stale PMD page table entry to remain in the walk cache
>>> after unshare, potentially leading to incorrect page table walks.
>>>
>>> Fix by including unshared_tables in the check, so that when
>>> unsharing tables, TLBF_NONE is used and the walk cache is properly
>>> invalidated.
>>>
>>> Here is the detailed distinction between vae1is and vale1is:
>>>
>>> | Instruction Combination | Actual Invalidation Scope |
>>> | ------------------------ | --------------------------------------------------|
>>> | `VAE1IS` + TTL=`0` | All entries at all levels (full invalidation) |
>>> | `VAE1IS` + TTL=`2` (L2) | Non-leaf at Level 0/1 + leaf at Level 2 |
>>> | `VALE1IS` + TTL=`0` | Leaf entries at all levels (non-leaf not cleared) |
>>> | `VALE1IS` + TTL=`2` (L2) | Leaf entry at Level 2 only |
>>>
>>> Signed-off-by: Zeng Heng <zengheng4@huawei.com>
>> The fix looks fine but does it need:
>>
>> Fixes: 8ce720d5bd91 ("mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather")
>> Cc: <stable@vger.kernel.org>
>>
>>> ---
>>> arch/arm64/include/asm/tlb.h | 3 ++-
>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
>>> index 10869d7731b8..751bd57bc3ba 100644
>>> --- a/arch/arm64/include/asm/tlb.h
>>> +++ b/arch/arm64/include/asm/tlb.h
>>> @@ -53,7 +53,8 @@ static inline int tlb_get_level(struct mmu_gather *tlb)
>>> static inline void tlb_flush(struct mmu_gather *tlb)
>>> {
>>> struct vm_area_struct vma = TLB_FLUSH_VMA(tlb->mm, 0);
>>> - tlbf_t flags = tlb->freed_tables ? TLBF_NONE : TLBF_NOWALKCACHE;
>>> + tlbf_t flags = (tlb->freed_tables || tlb->unshared_tables) ?
>>> + TLBF_NONE : TLBF_NOWALKCACHE;
>>> unsigned long stride = tlb_get_unmap_size(tlb);
>>> int tlb_level = tlb_get_level(tlb);
> Do we need this as well?
The proposed fix has been validated against the issue scenarios and
works as expected.
Per the ARM Architecture Reference Manual, whether only the last-level
page table entry is invalidated is determined by the instruction used
(vale1is for leaf entry only, vae1is for walk cache including leaf entry
and
non-leaf entry), rather than the TTL field. The TTL field merely specifies
which level the leaf entry belongs to.
Setting TTL to 0 always works fine, however, hardware must assume that
the entry can be from any level.[1][2]
[1]: vae1is instruction introduction by ARM spec,
https://developer.arm.com/documentation/ddi0601/2026-03/AArch64-Instructions/TLBI-VAE1IS--TLBI-VAE1ISNXS--TLB-Invalidate-by-VA--EL1--Inner-Shareable
[2]: rvae1is instruction introduction by ARM spec,
https://developer.arm.com/documentation/ddi0601/2026-03/AArch64-Instructions/TLBI-RVAE1IS--TLBI-RVAE1ISNXS--TLB-Range-Invalidate-by-VA--EL1--Inner-Shareable?lang=en
Best regards,
Zeng Heng
>
> diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
> index 10869d7731b8..3f4ab38cfd6e 100644
> --- a/arch/arm64/include/asm/tlb.h
> +++ b/arch/arm64/include/asm/tlb.h
> @@ -24,7 +24,7 @@ static void tlb_flush(struct mmu_gather *tlb);
> static inline int tlb_get_level(struct mmu_gather *tlb)
> {
> /* The TTL field is only valid for the leaf entry. */
> - if (tlb->freed_tables)
> + if (tlb->freed_tables || tlb->unshared_tables)
> return TLBI_TTL_UNKNOWN;
>
> if (tlb->cleared_ptes && !(tlb->cleared_pmds ||
^ permalink raw reply
* [PATCH v3 6/6] mm/vmalloc: align vm_area so vmap() can batch mappings
From: Wen Jiang @ 2026-05-22 5:31 UTC (permalink / raw)
To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
Cc: baohua, Xueyuan.chen21, dev.jain, rppt, david, ryan.roberts,
anshuman.khandual, ajd, linux-kernel, jiangwen6
In-Reply-To: <20260522053146.83209-1-jiangwenxiaomi@gmail.com>
From: "Barry Song (Xiaomi)" <baohua@kernel.org>
Try to align the vmap virtual address to PMD_SHIFT or a
larger PTE mapping size hinted by the architecture, so
contiguous pages can be batch-mapped when setting PMD or
PTE entries.
Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
---
mm/vmalloc.c | 33 ++++++++++++++++++++++++++++++++-
1 file changed, 32 insertions(+), 1 deletion(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 50642246f4d40..040d400928aab 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3620,6 +3620,37 @@ static int vmap_batched(unsigned long addr, unsigned long end,
return err;
}
+static struct vm_struct *get_aligned_vm_area(unsigned long size,
+ unsigned long flags, const void *caller)
+{
+ struct vm_struct *vm_area;
+ unsigned int shift;
+
+ /* Try PMD alignment for large sizes */
+ if (size >= PMD_SIZE) {
+ vm_area = __get_vm_area_node(size, PMD_SIZE, PAGE_SHIFT, flags,
+ VMALLOC_START, VMALLOC_END,
+ NUMA_NO_NODE, GFP_KERNEL, caller);
+ if (vm_area)
+ return vm_area;
+ }
+
+ /* Try CONT_PTE alignment */
+ shift = arch_vmap_pte_supported_shift(size);
+ if (shift > PAGE_SHIFT) {
+ vm_area = __get_vm_area_node(size, 1UL << shift, PAGE_SHIFT, flags,
+ VMALLOC_START, VMALLOC_END,
+ NUMA_NO_NODE, GFP_KERNEL, caller);
+ if (vm_area)
+ return vm_area;
+ }
+
+ /* Fall back to page alignment */
+ return __get_vm_area_node(size, PAGE_SIZE, PAGE_SHIFT, flags,
+ VMALLOC_START, VMALLOC_END,
+ NUMA_NO_NODE, GFP_KERNEL, caller);
+}
+
/**
* vmap - map an array of pages into virtually contiguous space
* @pages: array of page pointers
@@ -3658,7 +3689,7 @@ void *vmap(struct page **pages, unsigned int count,
return NULL;
size = (unsigned long)count << PAGE_SHIFT;
- area = get_vm_area_caller(size, flags, __builtin_return_address(0));
+ area = get_aligned_vm_area(size, flags, __builtin_return_address(0));
if (!area)
return NULL;
--
2.34.1
^ permalink raw reply related
* [PATCH v3 5/6] mm/vmalloc: map contiguous pages in batches for vmap() if possible
From: Wen Jiang @ 2026-05-22 5:31 UTC (permalink / raw)
To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
Cc: baohua, Xueyuan.chen21, dev.jain, rppt, david, ryan.roberts,
anshuman.khandual, ajd, linux-kernel, jiangwen6
In-Reply-To: <20260522053146.83209-1-jiangwenxiaomi@gmail.com>
From: "Barry Song (Xiaomi)" <baohua@kernel.org>
In many cases, the pages passed to vmap() may include high-order
pages. For example, the systemheap often allocates pages in descending
order: order 8, then 4, then 0. Currently, vmap() iterates over every
page individually—even pages inside a high-order block are handled
one by one.
This patch detects physically contiguous pages (regardless of whether
they are compound or non-compound) by scanning with
num_pages_contiguous(), and maps them as a single contiguous block
whenever possible. The first page's pfn must be aligned to the
mapping order for the batched mapping to be used.
Pages with the same page_shift are coalesced and mapped via
vmap_pages_range_noflush_walk() to avoid page table rewalk.
As users typically allocate memory in descending orders (e.g.
8 → 4 → 0), once an order-0 page is encountered, we stop scanning
for contiguous pages since subsequent pages are likely order-0 as well.
Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
Co-developed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
---
mm/vmalloc.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 80 insertions(+), 2 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index deb764abc0571..50642246f4d40 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3542,6 +3542,84 @@ void vunmap(const void *addr)
}
EXPORT_SYMBOL(vunmap);
+static inline int get_vmap_batch_order(struct page **pages,
+ unsigned int max_steps, unsigned int idx)
+{
+ unsigned int nr_contig;
+ int order;
+
+ if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP) ||
+ ioremap_max_page_shift == PAGE_SHIFT)
+ return 0;
+
+ nr_contig = num_pages_contiguous(&pages[idx], max_steps);
+ if (nr_contig < 2)
+ return 0;
+
+ order = fls(nr_contig) - 1;
+
+ if (arch_vmap_pte_supported_shift(PAGE_SIZE << order) == PAGE_SHIFT)
+ return 0;
+
+ /* Ensure the first page's pfn is aligned to the order */
+ if (!IS_ALIGNED(page_to_pfn(pages[idx]), 1 << order))
+ return 0;
+
+ return order;
+}
+
+static int vmap_batched(unsigned long addr, unsigned long end,
+ pgprot_t prot, struct page **pages)
+{
+ unsigned int count = (end - addr) >> PAGE_SHIFT;
+ unsigned int prev_shift = 0, idx = 0;
+ unsigned long start = addr, map_addr = addr;
+ int err;
+
+ err = kmsan_vmap_pages_range_noflush(addr, end, prot, pages,
+ PAGE_SHIFT, GFP_KERNEL);
+ if (err)
+ goto out;
+
+ for (unsigned int i = 0; i < count; ) {
+ unsigned int shift = PAGE_SHIFT +
+ get_vmap_batch_order(pages, count - i, i);
+
+ if (!i)
+ prev_shift = shift;
+
+ if (shift != prev_shift) {
+ err = vmap_pages_range_noflush_walk(map_addr, addr,
+ prot, pages + idx,
+ min(prev_shift, PMD_SHIFT));
+ if (err)
+ goto out;
+ prev_shift = shift;
+ map_addr = addr;
+ idx = i;
+ }
+
+ /*
+ * Once small pages are encountered, the remaining pages
+ * are likely small as well.
+ */
+ if (shift == PAGE_SHIFT)
+ break;
+
+ addr += 1UL << shift;
+ i += 1U << (shift - PAGE_SHIFT);
+ }
+
+ /* Remaining */
+ if (map_addr < end)
+ err = vmap_pages_range_noflush_walk(map_addr, end,
+ prot, pages + idx, min(prev_shift, PMD_SHIFT));
+
+out:
+ flush_cache_vmap(start, end);
+ return err;
+}
+
/**
* vmap - map an array of pages into virtually contiguous space
* @pages: array of page pointers
@@ -3585,8 +3663,8 @@ void *vmap(struct page **pages, unsigned int count,
return NULL;
addr = (unsigned long)area->addr;
- if (vmap_pages_range(addr, addr + size, pgprot_nx(prot),
- pages, PAGE_SHIFT) < 0) {
+ if (vmap_batched(addr, addr + size, pgprot_nx(prot),
+ pages) < 0) {
vunmap(area->addr);
return NULL;
}
--
2.34.1
^ permalink raw reply related
* [PATCH v3 4/6] mm/vmalloc: Extend page table walk to support larger page_shift sizes and eliminate page table rewalk
From: Wen Jiang @ 2026-05-22 5:31 UTC (permalink / raw)
To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
Cc: baohua, Xueyuan.chen21, dev.jain, rppt, david, ryan.roberts,
anshuman.khandual, ajd, linux-kernel, jiangwen6
In-Reply-To: <20260522053146.83209-1-jiangwenxiaomi@gmail.com>
From: "Barry Song (Xiaomi)" <baohua@kernel.org>
vmap_pages_range_noflush_walk() (formerly vmap_small_pages_range_noflush())
provides a clean interface by taking struct page **pages and mapping them
via direct PTE iteration. This avoids the page table rewalk seen when
using vmap_range_noflush() for page_shift values other than PAGE_SHIFT.
Extend it to support larger page_shift values, and add PMD- and
contiguous-PTE mappings as well. Rename it to vmap_pages_range_noflush_walk()
since it now handles more than just small pages.
For vmalloc() allocations with VM_ALLOW_HUGE_VMAP, we no longer need to
iterate over pages one by one via vmap_range_noflush(), which would
otherwise lead to page table rewalk. The code is now unified with the
PAGE_SHIFT case by simply calling vmap_pages_range_noflush_walk().
Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
---
mm/vmalloc.c | 71 +++++++++++++++++++++++++++++-----------------------
1 file changed, 40 insertions(+), 31 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 53fd4ee460ea4..deb764abc0571 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -543,8 +543,10 @@ void vunmap_range(unsigned long addr, unsigned long end)
static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
unsigned long end, pgprot_t prot, struct page **pages, int *nr,
- pgtbl_mod_mask *mask)
+ pgtbl_mod_mask *mask, unsigned int shift)
{
+ unsigned long pfn, size;
+ unsigned int steps;
int err = 0;
pte_t *pte;
@@ -575,9 +577,10 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
break;
}
- set_pte_at(&init_mm, addr, pte, mk_pte(page, prot));
- (*nr)++;
- } while (pte++, addr += PAGE_SIZE, addr != end);
+ pfn = page_to_pfn(page);
+ size = vmap_set_ptes(pte, addr, end, pfn, prot, shift);
+ steps = PFN_DOWN(size);
+ } while (pte += steps, *nr += steps, addr += size, addr != end);
lazy_mmu_mode_disable();
*mask |= PGTBL_PTE_MODIFIED;
@@ -587,7 +590,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
unsigned long end, pgprot_t prot, struct page **pages, int *nr,
- pgtbl_mod_mask *mask)
+ pgtbl_mod_mask *mask, unsigned int shift)
{
pmd_t *pmd;
unsigned long next;
@@ -597,7 +600,27 @@ static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
return -ENOMEM;
do {
next = pmd_addr_end(addr, end);
- if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask))
+
+ if (shift == PMD_SHIFT) {
+ struct page *page = pages[*nr];
+ phys_addr_t phys_addr;
+
+ if (WARN_ON(!page))
+ return -ENOMEM;
+ if (WARN_ON(!pfn_valid(page_to_pfn(page))))
+ return -EINVAL;
+
+ phys_addr = page_to_phys(page);
+
+ if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot,
+ shift)) {
+ *mask |= PGTBL_PMD_MODIFIED;
+ *nr += 1 << (shift - PAGE_SHIFT);
+ continue;
+ }
+ }
+
+ if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask, shift))
return -ENOMEM;
} while (pmd++, addr = next, addr != end);
return 0;
@@ -605,7 +628,7 @@ static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
unsigned long end, pgprot_t prot, struct page **pages, int *nr,
- pgtbl_mod_mask *mask)
+ pgtbl_mod_mask *mask, unsigned int shift)
{
pud_t *pud;
unsigned long next;
@@ -615,7 +638,7 @@ static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
return -ENOMEM;
do {
next = pud_addr_end(addr, end);
- if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask))
+ if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask, shift))
return -ENOMEM;
} while (pud++, addr = next, addr != end);
return 0;
@@ -623,7 +646,7 @@ static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
unsigned long end, pgprot_t prot, struct page **pages, int *nr,
- pgtbl_mod_mask *mask)
+ pgtbl_mod_mask *mask, unsigned int shift)
{
p4d_t *p4d;
unsigned long next;
@@ -633,14 +656,14 @@ static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
return -ENOMEM;
do {
next = p4d_addr_end(addr, end);
- if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask))
+ if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask, shift))
return -ENOMEM;
} while (p4d++, addr = next, addr != end);
return 0;
}
-static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
- pgprot_t prot, struct page **pages)
+static int vmap_pages_range_noflush_walk(unsigned long addr, unsigned long end,
+ pgprot_t prot, struct page **pages, unsigned int shift)
{
unsigned long start = addr;
pgd_t *pgd;
@@ -655,7 +678,7 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
next = pgd_addr_end(addr, end);
if (pgd_bad(*pgd))
mask |= PGTBL_PGD_MODIFIED;
- err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
+ err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask, shift);
if (err)
break;
} while (pgd++, addr = next, addr != end);
@@ -678,27 +701,13 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
pgprot_t prot, struct page **pages, unsigned int page_shift)
{
- unsigned int i, nr = (end - addr) >> PAGE_SHIFT;
-
WARN_ON(page_shift < PAGE_SHIFT);
- if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
- page_shift == PAGE_SHIFT)
- return vmap_small_pages_range_noflush(addr, end, prot, pages);
+ if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC))
+ page_shift = PAGE_SHIFT;
- for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
- int err;
-
- err = vmap_range_noflush(addr, addr + (1UL << page_shift),
- page_to_phys(pages[i]), prot,
- page_shift);
- if (err)
- return err;
-
- addr += 1UL << page_shift;
- }
-
- return 0;
+ return vmap_pages_range_noflush_walk(addr, end, prot, pages,
+ min(page_shift, PMD_SHIFT));
}
int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
--
2.34.1
^ permalink raw reply related
* [PATCH v3 3/6] mm/vmalloc: Extract vmap_set_ptes() to consolidate PTE mapping logic
From: Wen Jiang @ 2026-05-22 5:31 UTC (permalink / raw)
To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
Cc: baohua, Xueyuan.chen21, dev.jain, rppt, david, ryan.roberts,
anshuman.khandual, ajd, linux-kernel, jiangwen6
In-Reply-To: <20260522053146.83209-1-jiangwenxiaomi@gmail.com>
Extract the common PTE mapping logic from vmap_pte_range() into a
shared helper vmap_set_ptes(). This handles both CONT_PTE and regular
PTE mappings in a single function, preparing for the next patch which
will extend vmap_pages_pte_range() to also use this helper.
The #ifdef CONFIG_HUGETLB_PAGE guard is moved inside vmap_set_ptes(),
so callers no longer need to handle the conditional compilation.
Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
---
mm/vmalloc.c | 49 ++++++++++++++++++++++++++++++++++---------------
1 file changed, 34 insertions(+), 15 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 2c2f74a07f396..53fd4ee460ea4 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -91,6 +91,35 @@ struct vfree_deferred {
static DEFINE_PER_CPU(struct vfree_deferred, vfree_deferred);
/*** Page table manipulation functions ***/
+
+/*
+ * Set PTE mappings for the given PFN. Try CONT_PTE mappings first when
+ * supported, otherwise fall back to PAGE_SIZE mappings.
+ *
+ * Return: mapping size.
+ */
+static __always_inline unsigned long vmap_set_ptes(pte_t *pte,
+ unsigned long addr, unsigned long end, u64 pfn,
+ pgprot_t prot, unsigned int max_page_shift)
+{
+#ifdef CONFIG_HUGETLB_PAGE
+ if (max_page_shift > PAGE_SHIFT) {
+ unsigned long size;
+
+ size = arch_vmap_pte_range_map_size(addr, end, pfn, max_page_shift);
+ if (size != PAGE_SIZE) {
+ pte_t entry = pfn_pte(pfn, prot);
+
+ entry = arch_make_huge_pte(entry, ilog2(size), 0);
+ set_huge_pte_at(&init_mm, addr, pte, entry, size);
+ return size;
+ }
+ }
+#endif
+ set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot));
+ return PAGE_SIZE;
+}
+
static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
phys_addr_t phys_addr, pgprot_t prot,
unsigned int max_page_shift, pgtbl_mod_mask *mask)
@@ -98,7 +127,8 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
pte_t *pte;
u64 pfn;
struct page *page;
- unsigned long size = PAGE_SIZE;
+ unsigned long size;
+ unsigned int steps;
if (WARN_ON_ONCE(!PAGE_ALIGNED(end - addr)))
return -EINVAL;
@@ -119,20 +149,9 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
BUG();
}
-#ifdef CONFIG_HUGETLB_PAGE
- size = arch_vmap_pte_range_map_size(addr, end, pfn, max_page_shift);
- if (size != PAGE_SIZE) {
- pte_t entry = pfn_pte(pfn, prot);
-
- entry = arch_make_huge_pte(entry, ilog2(size), 0);
- set_huge_pte_at(&init_mm, addr, pte, entry, size);
- pfn += PFN_DOWN(size);
- continue;
- }
-#endif
- set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot));
- pfn++;
- } while (pte += PFN_DOWN(size), addr += size, addr != end);
+ size = vmap_set_ptes(pte, addr, end, pfn, prot, max_page_shift);
+ steps = PFN_DOWN(size);
+ } while (pte += steps, pfn += steps, addr += size, addr != end);
lazy_mmu_mode_disable();
*mask |= PGTBL_PTE_MODIFIED;
--
2.34.1
^ permalink raw reply related
* [PATCH v3 2/6] arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple CONT_PTE
From: Wen Jiang @ 2026-05-22 5:31 UTC (permalink / raw)
To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
Cc: baohua, Xueyuan.chen21, dev.jain, rppt, david, ryan.roberts,
anshuman.khandual, ajd, linux-kernel, jiangwen6
In-Reply-To: <20260522053146.83209-1-jiangwenxiaomi@gmail.com>
From: "Barry Song (Xiaomi)" <baohua@kernel.org>
Allow arch_vmap_pte_range_map_size to batch multiple CONT_PTE hugepages,
reducing both PTE setup and TLB flush iterations.
Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
---
arch/arm64/include/asm/vmalloc.h | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
index 4ec1acd3c1b34..787fd17b48e2c 100644
--- a/arch/arm64/include/asm/vmalloc.h
+++ b/arch/arm64/include/asm/vmalloc.h
@@ -23,6 +23,8 @@ static inline unsigned long arch_vmap_pte_range_map_size(unsigned long addr,
unsigned long end, u64 pfn,
unsigned int max_page_shift)
{
+ unsigned long size;
+
/*
* If the block is at least CONT_PTE_SIZE in size, and is naturally
* aligned in both virtual and physical space, then we can pte-map the
@@ -40,7 +42,9 @@ static inline unsigned long arch_vmap_pte_range_map_size(unsigned long addr,
if (!IS_ALIGNED(PFN_PHYS(pfn), CONT_PTE_SIZE))
return PAGE_SIZE;
- return CONT_PTE_SIZE;
+ size = min3(end - addr, 1UL << max_page_shift, PMD_SIZE >> 1);
+ size = 1UL << __fls(size);
+ return size;
}
#define arch_vmap_pte_range_unmap_size arch_vmap_pte_range_unmap_size
--
2.34.1
^ permalink raw reply related
* [PATCH v3 1/6] arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE setup
From: Wen Jiang @ 2026-05-22 5:31 UTC (permalink / raw)
To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
Cc: baohua, Xueyuan.chen21, dev.jain, rppt, david, ryan.roberts,
anshuman.khandual, ajd, linux-kernel, jiangwen6
In-Reply-To: <20260522053146.83209-1-jiangwenxiaomi@gmail.com>
From: "Barry Song (Xiaomi)" <baohua@kernel.org>
For sizes aligned to CONT_PTE_SIZE and smaller than PMD_SIZE,
we can batch CONT_PTE settings instead of handling them individually.
Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
---
arch/arm64/mm/hugetlbpage.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index a42c05cf56408..c4d8b226126cb 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -110,6 +110,12 @@ static inline int num_contig_ptes(unsigned long size, size_t *pgsize)
contig_ptes = CONT_PTES;
break;
default:
+ if (size > 0 && size < PMD_SIZE &&
+ IS_ALIGNED(size, CONT_PTE_SIZE)) {
+ contig_ptes = size >> PAGE_SHIFT;
+ *pgsize = PAGE_SIZE;
+ break;
+ }
WARN_ON(!__hugetlb_valid_size(size));
}
@@ -359,6 +365,10 @@ pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags)
case CONT_PTE_SIZE:
return pte_mkcont(entry);
default:
+ if (pagesize > 0 && pagesize < PMD_SIZE &&
+ IS_ALIGNED(pagesize, CONT_PTE_SIZE))
+ return pte_mkcont(entry);
+
break;
}
pr_warn("%s: unrecognized huge page size 0x%lx\n",
--
2.34.1
^ permalink raw reply related
* [PATCH v3 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory
From: Wen Jiang @ 2026-05-22 5:31 UTC (permalink / raw)
To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
Cc: baohua, Xueyuan.chen21, dev.jain, rppt, david, ryan.roberts,
anshuman.khandual, ajd, linux-kernel, jiangwen6
From: jiangwen6 <jiangwen6@xiaomi.com>
This patchset accelerates ioremap, vmalloc, and vmap when the memory
is physically fully or partially contiguous. Two techniques are used:
1. Avoid page table rewalk when setting PTEs/PMDs for multiple memory
segments
2. Use batched mappings wherever possible in both vmalloc and ARM64
layers
Besides accelerating the mapping path, this also enables large
mappings (PMD and cont-PTE) for vmap, which are currently not
supported.
Patches 1-2 extend ARM64 vmalloc CONT-PTE mapping to support multiple
CONT-PTE regions instead of just one.
Patch 3 extracts a common helper vmap_set_ptes() that consolidates PTE
mapping logic between the ioremap and vmalloc/vmap paths, handling both
CONT_PTE and regular PTE mappings. This prepares for the next patch.
Patch 4 extends the page table walk path to support page shifts other
than PAGE_SHIFT and eliminates the page table rewalk for huge vmalloc
mappings. The function is renamed from vmap_small_pages_range_noflush()
to vmap_pages_range_noflush_walk().
Patches 5-6 add huge vmap support for contiguous pages, including
support for non-compound pages with pfn alignment verification.
On the RK3588 8-core ARM64 SoC, with tasks pinned to a little core and
the performance CPUfreq policy enabled, benchmark results:
* ioremap(1 MB): 1.35x faster (3407 ns -> 2526 ns)
* vmalloc(1 MB) mapping time (excluding allocation) with
VM_ALLOW_HUGE_VMAP: 1.42x faster (5.00 us -> 3.53us)
* vmap(100MB) with order-8 pages: 8.3x faster (1235 us -> 149 us)
Many thanks to Xueyuan Chen for his testing efforts on RK3588 boards.
Changes since v2:
- Use __fls instead of fls in arch_vmap_pte_range_map_size (patch 2)
- Add WARN_ON checks in vmap_pages_pmd_range (patch 4)
- Fix flush_cache_vmap to use saved start address instead of the
already-advanced addr (patch 5)
- Rename __vmap_huge() to vmap_batched() (patch 5)
- Add caller parameter and unroll while(1) loop (patch 5)
- Squash patch 7 into patch 5 (stop scanning for compound pages after
encountering small pages)
Changes since v1:
- Fix condition order and use PMD_SIZE instead of CONT_PMD_SIZE in
patch 1 (Dev Jain)
- Squash patch 3+4 and patch 5+7 (Dev Jain)
- Replace "zigzag" with "page table rewalk" in commit messages
(Dev Jain)
- Rename vmap_small_pages_range_noflush() to
vmap_pages_range_noflush_walk() (Dev Jain)
- Extract vmap_set_ptes() as a new patch to consolidate PTE mapping
logic between vmap_pte_range() and vmap_pages_pte_range(), handling
both CONT_PTE and regular mappings (Mike Rapoport)
- Support non-compound pages in get_vmap_batch_order() by falling
back to physical contiguity scanning with pfn alignment check
(Dev Jain, Uladzislau Rezki)
- In get_vmap_batch_order(), filter out orders that the architecture
cannot batch by checking arch_vmap_pte_supported_shift() directly.
This avoids overhead for orders 1-3 on ARM64 CONT_PTE with 4K
pages. (patch 5)
Barry Song (Xiaomi) (5):
arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE
setup
arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple
CONT_PTE
mm/vmalloc: Extend page table walk to support larger page_shift sizes
and eliminate page table rewalk
mm/vmalloc: map contiguous pages in batches for vmap() if possible
mm/vmalloc: align vm_area so vmap() can batch mappings
Wen Jiang (1):
mm/vmalloc: Extract vmap_set_ptes() to consolidate PTE mapping logic
arch/arm64/include/asm/vmalloc.h | 6 +-
arch/arm64/mm/hugetlbpage.c | 10 ++
mm/vmalloc.c | 235 ++++++++++++++++++++++++-------
3 files changed, 201 insertions(+), 50 deletions(-)
--
2.34.1
^ permalink raw reply
* [PATCH v4 4/4] Documentation: PCI: Add documentation for DOE endpoint support
From: Aksh Garg @ 2026-05-22 5:24 UTC (permalink / raw)
To: linux-pci, linux-doc, mani, kwilczynski, bhelgaas, corbet, kishon,
skhan, lukas, cassel, alistair
Cc: linux-arm-kernel, linux-kernel, s-vadapalli, danishanwar, srk,
a-garg7
In-Reply-To: <20260522052434.802034-1-a-garg7@ti.com>
Document the architecture and implementation details for the Data Object
Exchange (DOE) framework for PCIe Endpoint devices.
Co-developed-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Signed-off-by: Aksh Garg <a-garg7@ti.com>
---
Changes from v3 to v4:
- Updated the maximum size of the DOE object from 256KB to 1MB,
as per PCIe spec.
- Updated the DOE setup and cleanup sections.
Changes from v2 to v3:
- Rebased on 7.1-rc1.
Changes since v1:
- Squashed the patches [1] and [2], and moved the documentation file
to Documentation/PCI/endpoint/pci-endpoint-doe.rst to match the existing
naming scheme, as suggested by Niklas Cassel
- Updated the documentation as per the design and implementaion changes
made to previous patches in this series:
* Updated for static protocol array instead of dynamic registration
* Documented asynchronous callback model
* Updated request/response flow with new callback signature
* Updated memory ownership: DOE core frees request, driver frees response
* Updated initialization and cleanup sections for new APIs
v3: https://lore.kernel.org/all/20260427051725.223704-5-a-garg7@ti.com/
v2: https://lore.kernel.org/all/20260401073022.215805-5-a-garg7@ti.com/
v1: [1] https://lore.kernel.org/all/20260213123603.420941-2-a-garg7@ti.com/
[2] https://lore.kernel.org/all/20260213123603.420941-5-a-garg7@ti.com/
Documentation/PCI/endpoint/index.rst | 1 +
.../PCI/endpoint/pci-endpoint-doe.rst | 329 ++++++++++++++++++
2 files changed, 330 insertions(+)
create mode 100644 Documentation/PCI/endpoint/pci-endpoint-doe.rst
diff --git a/Documentation/PCI/endpoint/index.rst b/Documentation/PCI/endpoint/index.rst
index dd1f62e731c9..7c03d5abd2ef 100644
--- a/Documentation/PCI/endpoint/index.rst
+++ b/Documentation/PCI/endpoint/index.rst
@@ -9,6 +9,7 @@ PCI Endpoint Framework
pci-endpoint
pci-endpoint-cfs
+ pci-endpoint-doe
pci-test-function
pci-test-howto
pci-ntb-function
diff --git a/Documentation/PCI/endpoint/pci-endpoint-doe.rst b/Documentation/PCI/endpoint/pci-endpoint-doe.rst
new file mode 100644
index 000000000000..a424a0a3add6
--- /dev/null
+++ b/Documentation/PCI/endpoint/pci-endpoint-doe.rst
@@ -0,0 +1,329 @@
+.. SPDX-License-Identifier: GPL-2.0-only OR MIT
+
+.. include:: <isonum.txt>
+
+=============================================
+Data Object Exchange (DOE) for PCIe Endpoint
+=============================================
+
+:Copyright: |copy| 2026 Texas Instruments Incorporated
+:Author: Aksh Garg <a-garg7@ti.com>
+:Co-Author: Siddharth Vadapalli <s-vadapalli@ti.com>
+
+Overview
+========
+
+DOE (Data Object Exchange) is a standard PCIe extended capability feature
+introduced in the Data Object Exchange (DOE) ECN for PCIe r5.0. It is an optional
+mechanism for system firmware/software running on root complex (host) to perform
+:ref:`data object <data-object-term>` exchanges with an endpoint function. Each
+data object is uniquely identified by the Vendor ID of the vendor publishing the
+data object definition and a Data Object Type value assigned by that vendor.
+
+Think of DOE as a sophisticated mailbox system built into PCIe. The root complex
+can send structured requests to the endpoint device through DOE mailboxes, and
+the endpoint device responds with appropriate data. DOE mailboxes are implemented
+as PCIe Extended Capabilities in endpoint devices, allowing multiple mailboxes
+per function, each potentially supporting different data object protocols.
+
+The DOE support for root complex devices has already been implemented in
+``drivers/pci/doe.c``.
+
+How DOE Works
+=============
+
+The DOE mailbox operates through a simple request-response model:
+
+1. **Host sends request**: The root complex writes a data object (vendor ID, type,
+ and payload) to the DOE write mailbox register (one DWORD at a time) of the
+ endpoint function's config space and sets the GO bit in the DOE Control register
+ to indicate that a request is ready for processing.
+2. **Endpoint processes**: The endpoint function reads the request from DOE write
+ mailbox register, sets the BUSY bit in the DOE Status register, identifies the
+ protocol of the data object, and executes the appropriate handler.
+3. **Endpoint responds**: The endpoint function writes the response data object to the
+ DOE read mailbox register (one DWORD at a time), and sets the READY bit in the DOE
+ Status register to indicate that the response is ready. If an error occurs during
+ request processing (such as unsupported protocol or handler failure), the endpoint
+ sets the ERROR bit in the DOE Status register instead of the READY bit.
+4. **Host reads response**: The root complex retrieves the response data from the DOE read
+ mailbox register once the READY bit is set in the DOE Status register, and then writes
+ any value to this register to indicate a successful read. If the ERROR bit was set,
+ the root complex discards the response and performs error handling as needed.
+
+Each mailbox operates independently and can handle one transaction at a time. The
+DOE specification supports data objects of size up to 1MB (2\ :sup:`18` dwords).
+
+For complete DOE capability details, refer to `PCI Express Base Specification Revision 7.0,
+Section 6.30 - Data Object Exchange (DOE)`.
+
+Key Terminologies
+=================
+
+.. _data-object-term:
+
+**Data Object**
+ A structured, vendor-defined, or standard-defined message exchanged between
+ root complex and endpoint function via DOE capability registers in configuration
+ space of the function.
+
+**Mailbox**
+ A DOE capability on the endpoint device, where each physical function can have
+ multiple mailboxes.
+
+**Protocol**
+ A specific type of DOE communication data object identified by a Vendor ID and Type.
+
+**Handler**
+ A function that processes DOE requests of a specific protocol and generates responses.
+
+Architecture of DOE Implementation for Endpoint
+===============================================
+
+.. code-block:: text
+
+ +------------------+
+ | |
+ | Root Complex |
+ | |
+ +--------^---------+
+ |
+ | Config space access
+ | over PCIe link
+ |
+ +----------v-----------+
+ | |
+ | PCIe Controller |
+ | as Endpoint |
+ | |
+ | +-----------------+ |
+ | | DOE Mailbox | |
+ | +-------^---------+ |
+ +----------|-----------+
+ +-----------|---------------------------------------------------------------+
+ | | +--------------------+ |
+ | +---------v--------+ Allocate | +--------------+ | |
+ | | |-------------------------------->| Request | | |
+ | | EP Controller | +--->| Buffer | | |
+ | | Driver | Free | | +--------------+ | |
+ | | |--------------------------+ | | | |
+ | +--------^---------+ | | | | |
+ | | | | | | |
+ | | | | | | |
+ | | pci_ep_doe_process_request() | | | | |
+ | | | | | | |
+ | +--------v---------+ Free | | | | |
+ | | |----------------------------+ | DDR | |
+ | | DOE EP Core |<----+ | | | |
+ | | (doe-ep.c) | | Discovery | | | |
+ | | |-----+ Protocol Handler | | | |
+ | +--------^---------+ | | | |
+ | | | | | |
+ | | protocol_handler() | | | |
+ | | | | | |
+ | +--------v---------+ | | | |
+ | | | | | +--------------+ | |
+ | | Protocol Handler | +----->| Response | | |
+ | | Module |-------------------------------->| Buffer | | |
+ | | (CMA/SPDM/Other) | Allocate | +--------------+ | |
+ | | | | | |
+ | +------------------+ | | |
+ | +--------------------+ |
+ +---------------------------------------------------------------------------+
+
+Initialization and Cleanup
+--------------------------
+
+**Framework Initialization and DOE Setup**
+
+The EPC core automatically initializes and sets up DOE mailboxes through the
+``pci_epc_init_capabilities()`` internal function, which is invoked during
+``pci_epc_init_notify()`` when the controller driver calls this API.
+Controller drivers do not need to explicitly handle DOE initialization,
+rather the EPC core manages this transparently.
+
+DOE initialization only occurs when the EPC driver reports DOE capability
+through the ``doe_capable`` flag in its ``pci_epc_features``.
+
+This internal function performs the following steps:
+
+1. Calls ``pci_ep_doe_init(epc)`` to initialize the xarray data structure
+ (a resizable array data structure defined in linux) named ``doe_mbs`` that
+ stores metadata of DOE mailboxes for the controller in ``struct pci_epc``.
+2. Calls ``pci_epc_doe_setup(epc)`` to discover all DOE capabilities in the
+ endpoint function's configuration space for each function. For each
+ discovered DOE capability, calls ``pci_ep_doe_add_mailbox(epc, func_no,
+ cap_offset)`` to register the mailbox.
+
+Each DOE mailbox structure created by ``pci_ep_doe_add_mailbox()`` gets an
+ordered workqueue allocated for processing DOE requests sequentially for that
+mailbox, enabling concurrent request handling across different mailboxes. Each
+mailbox is uniquely identified by the combination of physical function number
+and capability offset for that controller.
+
+**Cleanup**
+
+The EPC core automatically cleans up DOE mailboxes through the
+``pci_epc_deinit_capabilities()`` internal function, which is invoked during
+``pci_epc_deinit_notify()`` when the controller driver calls this API.
+Controller drivers do not need to explicitly handle DOE cleanup, rather
+the EPC core manages this transparently.
+
+DOE cleanup only occurs when the EPC device reported DOE capability
+through the ``doe_capable`` flag in its ``pci_epc_features``.
+
+This internal function calls ``pci_ep_doe_destroy(epc)``, which destroys all
+registered mailboxes, cancels any pending tasks, flushes and destroys the
+workqueues, and frees all memory allocated to the mailboxes.
+
+Protocol Handler Support
+------------------------
+
+Protocol implementations (such as CMA, SPDM, or vendor-specific protocols) are
+supported through a static array of protocol handlers.
+
+When a new DOE protocol library is introduced, its handler function is added to
+the static ``pci_doe_protocols`` array in ``drivers/pci/endpoint/pci-ep-doe.c``.
+The discovery protocol (VID = 0x0001 (PCI-SIG vendor ID), Type = 0x00 (discovery
+protocol)) is included in this static array and handled internally by the
+DOE EP core.
+
+Request Handling
+----------------
+
+The complete flow of a DOE request from the root complex to the response:
+
+**Step 1: Root Complex → EP Controller Driver**
+
+The root complex writes a DOE request (Vendor ID, Type, and Payload) to the
+DOE write mailbox register in the endpoint function's configuration space and sets
+the GO bit in the DOE Control register, indicating that the request is ready for
+processing.
+
+**Step 2: EP Controller Driver → DOE EP Core**
+
+The controller driver reads the request header to determine the data object
+length. Based on this length field, it allocates a request buffer in memory
+(DDR) of the appropriate size. The driver then reads the complete request
+payload from the DOE write mailbox register and converts the data from
+little-endian format (the format followed in the PCIe transactions over the
+link) to CPU-native format using ``le32_to_cpu()``. The driver defines a
+completion callback function with signature ``void (*complete)(struct pci_epc *epc,
+u8 func_no, u16 cap_offset, int status, u16 vendor, u8 type, void *response_pl,
+size_t response_pl_sz)`` to be invoked when the request processing completes.
+The driver then calls ``pci_ep_doe_process_request(epc, func_no, cap_offset,
+vendor, type, request, request_sz, complete)`` to hand off the request to the
+DOE EP core. This function returns immediately after queuing the work
+(without blocking), and the driver sets the BUSY bit in the DOE Status register.
+
+**Step 3: DOE EP Core Processing**
+
+The DOE EP core creates a task structure and submits it to the mailbox's ordered
+workqueue. This ensures that requests for each mailbox are processed
+sequentially, one at a time, as required by the DOE specification. It looks up
+the protocol handler based on the Vendor ID and Type from the request header,
+and executes the handler function.
+
+**Step 4: Protocol Handler Execution**
+
+The workqueue executes the task by calling the registered protocol handler:
+``handler(request, request_sz, &response, &response_sz)``. The handler processes
+the request, allocates a response buffer in memory (DDR), builds the response
+data, and returns the response pointer and size. For the discovery protocol,
+the DOE EP core handles this directly without invoking an external handler.
+
+**Step 5: DOE EP Core → EP Controller Driver**
+
+After the protocol handler completes, the DOE EP core frees the request buffer,
+and invokes the completion callback provided by the controller driver asynchronously.
+The callback receives the struct pci_epc, function number, capability offset (to
+identify the mailbox), status code indicating the result of request processing,
+vendor ID and type of the data object, the response buffer, and its size.
+
+**Step 6: EP Controller Driver → Root Complex**
+
+The controller driver converts the response from CPU-native format to
+little-endian format using ``cpu_to_le32()``, writes the response to DOE read
+mailbox register, and sets the READY bit in the DOE Status register. The root
+complex then reads the response from the read mailbox register. Finally, the controller
+driver frees the response buffer (which the handler allocated).
+
+Asynchronous Request Processing
+-------------------------------
+
+The DOE-EP framework implements asynchronous request processing because an
+endpoint function can have multiple instances of DOE mailboxes, and requests may
+be interleaved across these mailboxes. Request processing of one mailbox should
+not result in blocking request processing of other mailboxes. Hence, requests
+on each mailbox need to be handled in parallel for optimization.
+
+For the EP controller driver to handle requests on multiple mailboxes in
+parallel, ``pci_ep_doe_process_request()`` must be asynchronous. The function
+returns immediately after submitting the request to the mailbox's workqueue,
+without waiting for the request to complete. A completion callback provided by
+the controller driver is invoked asynchronously when request processing
+finishes. This asynchronous design enables concurrent processing of requests
+across different mailboxes.
+
+Abort Handling
+--------------
+
+The DOE specification allows the root complex to abort ongoing DOE operations
+by setting the ABORT bit in the DOE Control register.
+
+**Trigger**
+
+When the root complex sets the ABORT bit, the EP controller driver detects this
+condition (typically in an interrupt handler or register polling routine). The
+action taken depends on the timing of the abort:
+
+- **ABORT during request transfer**: If the ABORT bit is set while the root complex
+ is still transferring the request to the mailbox registers, the controller driver
+ discards the request and no call to ``pci_ep_doe_abort()`` is needed.
+
+- **ABORT after request submission**: If the ABORT bit is set after the request
+ has been fully received and submitted to the DOE EP core via
+ ``pci_ep_doe_process_request()``, the controller driver must call
+ ``pci_ep_doe_abort(epc, func_no, cap_offset)`` for the affected mailbox to
+ perform abort sequence in the DOE EP core.
+
+**Abort Sequence**
+
+The abort function performs the following actions:
+
+1. Sets the CANCEL flag on the mailbox to prevent queued requests from starting
+2. Flushes the workqueue to wait for any currently executing handler to complete
+ (handlers cannot be interrupted mid-execution)
+3. Clears the CANCEL flag to allow the mailbox to accept new requests
+
+Queued requests that have not started execution will be aborted with an error
+status. The currently executing request will complete normally, and the controller
+will reject the response if it arrives after the abort sequence has been triggered.
+
+.. note::
+ Independent of when the ABORT bit is triggered, the controller driver must
+ clear the ERROR, BUSY, and READY bits in the DOE Status register after
+ completing the abort operation to reset the mailbox to an idle state.
+
+Error Handling
+--------------
+
+Errors can occur during DOE request processing for various reasons, such as
+unsupported protocols, handler failures, or memory allocation failures.
+
+**Error Detection**
+
+When an error occurs during DOE request processing, the DOE EP core propagates this error
+back to the controller driver either through the ``pci_ep_doe_process_request()`` return value,
+or the status code passed to the completion callback.
+
+**Error Response**
+
+When the controller driver receives an error code, it sets the ERROR bit in the DOE Status
+register instead of writing a response to the read mailbox register, and frees the buffers.
+
+API Reference
+=============
+
+.. kernel-doc:: drivers/pci/endpoint/pci-ep-doe.c
+ :export:
--
2.34.1
^ permalink raw reply related
* [PATCH v4 3/4] PCI: endpoint: Add support for DOE initialization and setup in EPC core
From: Aksh Garg @ 2026-05-22 5:24 UTC (permalink / raw)
To: linux-pci, linux-doc, mani, kwilczynski, bhelgaas, corbet, kishon,
skhan, lukas, cassel, alistair
Cc: linux-arm-kernel, linux-kernel, s-vadapalli, danishanwar, srk,
a-garg7
In-Reply-To: <20260522052434.802034-1-a-garg7@ti.com>
Add pci_epc_init_capabilities() in EPC core driver to initialize and
setup the capabilities supported by the EPC driver. This calls
pci_epc_doe_setup() to setup the DOE framework for an endpoint controller,
which discovers the DOE capabilities (extended capability ID 0x2E), and
registers each discovered DOE mailbox for all the functions in the
endpoint controller.
Add pci_epc_deinit_capabilities() in EPC core driver for cleanup of the
resources used by the capabilities of the EPC driver. This calls
pci_ep_doe_destroy() to destroy all DOE mailboxes and free associated
resources.
Co-developed-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Signed-off-by: Aksh Garg <a-garg7@ti.com>
---
Changes from v3 to v4:
- Call DOE setup and destroy APIs directly within the EPC core, instead of
relying on the EPC drivers to call them individually. EPC drivers do not
need to explicitly handle DOE setup, rather the EPC core manages this
transparently. (Suggested by Manivannan Sadhasivam).
- Removed pci_epc_doe_destroy() API, which was just calling pci_ep_doe_destroy().
Instead, called pci_ep_doe_destroy() directly during cleanup.
- Called pci_ep_doe_init() before the "!epc->ops->find_ext_capability" check,
because if doe-capable=1 and find_ext_capability() op is undefined, this
would not initialize the epc->doe_mbs xarray. However during cleanup, the
check "!epc->ops->find_ext_capability" would be unnecessary, and it will
try to destroy the epc->doe_mbs xarray even when it was not initialized.
Changes from v2 to v3:
- Rebased on 7.1-rc1.
Changes since v1:
- New patch added to v2 (not present in v1)
v3: https://lore.kernel.org/all/20260427051725.223704-4-a-garg7@ti.com/
v2: https://lore.kernel.org/all/20260401073022.215805-4-a-garg7@ti.com/
This patch is introduced based on the feedback provided by Manivannan
Sadhasivam at [1].
[1]: https://lore.kernel.org/all/p57x6jleaim5w7t2k3v7tioujnaxuovfpj5euop5ogefvw23se@y5fw3che5p5d/
drivers/pci/endpoint/pci-epc-core.c | 92 +++++++++++++++++++++++++++++
include/linux/pci-epc.h | 6 ++
2 files changed, 98 insertions(+)
diff --git a/drivers/pci/endpoint/pci-epc-core.c b/drivers/pci/endpoint/pci-epc-core.c
index 6c3c58185fc5..af7889b541c0 100644
--- a/drivers/pci/endpoint/pci-epc-core.c
+++ b/drivers/pci/endpoint/pci-epc-core.c
@@ -14,6 +14,8 @@
#include <linux/pci-epf.h>
#include <linux/pci-ep-cfs.h>
+#include "../pci.h"
+
static const struct class pci_epc_class = {
.name = "pci_epc",
};
@@ -842,6 +844,71 @@ void pci_epc_linkdown(struct pci_epc *epc)
}
EXPORT_SYMBOL_GPL(pci_epc_linkdown);
+/**
+ * pci_epc_doe_setup() - Discover and setup DOE mailboxes for all functions
+ * @epc: the EPC device on which DOE mailboxes has to be setup
+ *
+ * Discover DOE (Data Object Exchange) capabilities for all physical functions
+ * in the endpoint controller and register DOE mailboxes.
+ *
+ * Returns: 0 on success, -errno on failure
+ */
+static int pci_epc_doe_setup(struct pci_epc *epc)
+{
+ u16 cap_offset = 0;
+ u8 func_no;
+ int ret;
+
+ if (!epc->ops || !epc->ops->find_ext_capability)
+ return -EINVAL;
+
+ /* Discover DOE capabilities for all functions */
+ for (func_no = 0; func_no < epc->max_functions; func_no++) {
+ while ((cap_offset = epc->ops->find_ext_capability(epc, func_no, 0,
+ cap_offset,
+ PCI_EXT_CAP_ID_DOE))) {
+ /* Register this DOE mailbox */
+ ret = pci_ep_doe_add_mailbox(epc, func_no, cap_offset);
+ if (ret) {
+ dev_warn(&epc->dev,
+ "[pf%d:offset %x] failed to add DOE mailbox\n",
+ func_no, cap_offset);
+ }
+ }
+ }
+
+ dev_dbg(&epc->dev, "DOE mailboxes setup complete\n");
+ return 0;
+}
+
+/**
+ * pci_epc_init_capabilities() - Initialize EPC capabilities
+ * @epc: the EPC device whose capabilities need to be initialized
+ *
+ * Invoke to initialize capabilities supported by the EPC device.
+ */
+static void pci_epc_init_capabilities(struct pci_epc *epc)
+{
+ const struct pci_epc_features *epc_features;
+ int ret;
+
+ epc_features = pci_epc_get_features(epc, 0, 0);
+ if (!epc_features)
+ return;
+
+ if (IS_ENABLED(CONFIG_PCI_ENDPOINT_DOE) && epc_features->doe_capable) {
+ ret = pci_ep_doe_init(epc);
+ if (ret) {
+ dev_warn(&epc->dev, "DOE initialization failed: %d\n", ret);
+ return;
+ }
+
+ ret = pci_epc_doe_setup(epc);
+ if (ret)
+ dev_warn(&epc->dev, "DOE setup failed: %d\n", ret);
+ }
+}
+
/**
* pci_epc_init_notify() - Notify the EPF device that EPC device initialization
* is completed.
@@ -857,6 +924,8 @@ void pci_epc_init_notify(struct pci_epc *epc)
if (IS_ERR_OR_NULL(epc))
return;
+ pci_epc_init_capabilities(epc);
+
mutex_lock(&epc->list_lock);
list_for_each_entry(epf, &epc->pci_epf, list) {
mutex_lock(&epf->lock);
@@ -890,6 +959,27 @@ void pci_epc_notify_pending_init(struct pci_epc *epc, struct pci_epf *epf)
}
EXPORT_SYMBOL_GPL(pci_epc_notify_pending_init);
+/**
+ * pci_epc_deinit_capabilities() - Cleanup EPC capabilities
+ * @epc: the EPC device whose capabilities need to be cleaned up
+ *
+ * Invoke to cleanup capabilities supported by the EPC device,
+ * and free the associated resources.
+ */
+static void pci_epc_deinit_capabilities(struct pci_epc *epc)
+{
+ const struct pci_epc_features *epc_features;
+
+ epc_features = pci_epc_get_features(epc, 0, 0);
+ if (!epc_features)
+ return;
+
+ if (IS_ENABLED(CONFIG_PCI_ENDPOINT_DOE) && epc_features->doe_capable) {
+ pci_ep_doe_destroy(epc);
+ dev_dbg(&epc->dev, "DOE mailboxes destroyed\n");
+ }
+}
+
/**
* pci_epc_deinit_notify() - Notify the EPF device about EPC deinitialization
* @epc: the EPC device whose deinitialization is completed
@@ -903,6 +993,8 @@ void pci_epc_deinit_notify(struct pci_epc *epc)
if (IS_ERR_OR_NULL(epc))
return;
+ pci_epc_deinit_capabilities(epc);
+
mutex_lock(&epc->list_lock);
list_for_each_entry(epf, &epc->pci_epf, list) {
mutex_lock(&epf->lock);
diff --git a/include/linux/pci-epc.h b/include/linux/pci-epc.h
index dd26294c8175..11474e337db3 100644
--- a/include/linux/pci-epc.h
+++ b/include/linux/pci-epc.h
@@ -84,6 +84,8 @@ struct pci_epc_map {
* @start: ops to start the PCI link
* @stop: ops to stop the PCI link
* @get_features: ops to get the features supported by the EPC
+ * @find_ext_capability: ops to find extended capability offset for a function
+ * in endpoint controller
* @owner: the module owner containing the ops
*/
struct pci_epc_ops {
@@ -115,6 +117,8 @@ struct pci_epc_ops {
void (*stop)(struct pci_epc *epc);
const struct pci_epc_features* (*get_features)(struct pci_epc *epc,
u8 func_no, u8 vfunc_no);
+ u16 (*find_ext_capability)(struct pci_epc *epc, u8 func_no,
+ u8 vfunc_no, u16 start, u8 cap);
struct module *owner;
};
@@ -270,6 +274,7 @@ struct pci_epc_bar_desc {
* @msi_capable: indicate if the endpoint function has MSI capability
* @msix_capable: indicate if the endpoint function has MSI-X capability
* @intx_capable: indicate if the endpoint can raise INTx interrupts
+ * @doe_capable: indicate if the endpoint function has DOE capability
* @bar: array specifying the hardware description for each BAR
* @align: alignment size required for BAR buffer allocation
*/
@@ -280,6 +285,7 @@ struct pci_epc_features {
unsigned int msi_capable : 1;
unsigned int msix_capable : 1;
unsigned int intx_capable : 1;
+ unsigned int doe_capable : 1;
struct pci_epc_bar_desc bar[PCI_STD_NUM_BARS];
size_t align;
};
--
2.34.1
^ permalink raw reply related
* [PATCH v4 2/4] PCI: endpoint: Add DOE mailbox support for endpoint functions
From: Aksh Garg @ 2026-05-22 5:24 UTC (permalink / raw)
To: linux-pci, linux-doc, mani, kwilczynski, bhelgaas, corbet, kishon,
skhan, lukas, cassel, alistair
Cc: linux-arm-kernel, linux-kernel, s-vadapalli, danishanwar, srk,
a-garg7
In-Reply-To: <20260522052434.802034-1-a-garg7@ti.com>
DOE (Data Object Exchange) is a standard PCIe extended capability
feature introduced in the Data Object Exchange (DOE) ECN for
PCIe r5.0. It provides a communication mechanism primarily used for
implementing PCIe security features such as device authentication, and
secure link establishment. Think of DOE as a sophisticated mailbox
system built into PCIe. The root complex can send structured requests
to the endpoint device through DOE mailboxes, and the endpoint device
responds with appropriate data.
Add the DOE support for PCIe endpoint devices, enabling endpoint
functions to process the DOE requests from the host. The implementation
provides framework APIs for EPC core driver and controller drivers to
register mailboxes, and request processing with workqueues ensuring
sequential handling per mailbox, and parallel handling across mailboxes.
The Discovery protocol is handled internally by the DOE core.
This implementation complements the existing DOE implementation for
root complex in drivers/pci/doe.c.
Co-developed-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Signed-off-by: Aksh Garg <a-garg7@ti.com>
---
Changes from v3 to v4:
- Used 'Returns' instead of 'RETURNS' in the function docstrings to
comply with kernel-doc format, as suggested by Manivannan Sadhasivam.
- In pci_ep_doe_process_request(), changed the type of request buffer
from "const void *" to "void *", as the ownership is transferred to
DOE-EP framework, which is responsible to free the buffer.
- Added "struct pci_epc *epc" to typedef "pci_ep_doe_complete_t", to be
used by the EPC driver.
Changes from v2 to v3:
- Rebased on 7.1-rc1.
Changes since v1:
- Moved the DOE-EP core file to drivers/pci/endpoint/pci-ep-doe.c, and
corresponding Kconfig and Makefile to match the existing naming scheme,
as suggested by Niklas Cassel.
- Renamed the config from PCI_DOE_EP to PCI_ENDPOINT_DOE
- Moved the function declarations that need not be visible outside the
PCI core to drivers/pci/pci.h instead to include/linux/pci-doe.h as
suggested by Lukas Wunner
- Converted from synchronous to asynchronous request processing:
* Removed wait_for_completion() from pci_ep_doe_process_request()
* Function returns immediately after queuing to workqueue, hence
removed private data for completion in the task structure
* Added completion callback as an additional argument to
pci_ep_doe_process_request(), which takes the response and status
parameters as arguments (along with other required arguments), hence
removed task_status in the task structure
* Created a typedef pci_ep_doe_complete_t for completion callback
* Removed the pci_ep_doe_task_complete() function, as it would not be
required anymore with these changes
* Moved from INIT_WORK_ONSTACK() to INIT_WORK(), to initialize the work
on heap instead of stack
* signal_task_complete() now invokes the completion callback, once the
protocol handler completes its task
- Changed from dynamic xarray-based protocol registration to static array:
* Removed the register/unregister protocol APIs
* Replaced the dynamic xarray with static array of struct pci_doe_protocol
* Added discovery protocol to static array, instead of treating it specially,
hence removed the special handling for Discovery protocol in
doe_ep_task_work()
* Updated pci_ep_doe_handle_discovery() and pci_ep_doe_find_protocol()
accordingly.
- Memory Management:
* DOE core frees request buffer in signal_task_complete()
or during error handling
* pci_ep_doe_process_request() defines response_pl and response_pl_sz
as NULL and 0 respectively, whose pointer is passed to the protocol
handler, hence removed the arguments void **response, size_t *response_sz
to this function.
- Task structure refactoring:
* Response buffer: void **response_pl to void *response_pl
* Response size: size_t *response_pl_sz to size_t response_pl_sz
* Changed the completion callback to type pci_ep_doe_complete_t
* Removed void *private and int task_status
- Updated documentation comments of the functions according to the changes
v3: https://lore.kernel.org/all/20260427051725.223704-3-a-garg7@ti.com/
v2: https://lore.kernel.org/all/20260401073022.215805-3-a-garg7@ti.com/
v1: https://lore.kernel.org/all/20260213123603.420941-4-a-garg7@ti.com/
drivers/pci/endpoint/Kconfig | 14 +
drivers/pci/endpoint/Makefile | 1 +
drivers/pci/endpoint/pci-ep-doe.c | 553 ++++++++++++++++++++++++++++++
drivers/pci/pci.h | 39 +++
include/linux/pci-doe.h | 5 +
include/linux/pci-epc.h | 3 +
6 files changed, 615 insertions(+)
create mode 100644 drivers/pci/endpoint/pci-ep-doe.c
diff --git a/drivers/pci/endpoint/Kconfig b/drivers/pci/endpoint/Kconfig
index 8dad291be8b8..15ae16aaa58f 100644
--- a/drivers/pci/endpoint/Kconfig
+++ b/drivers/pci/endpoint/Kconfig
@@ -36,6 +36,20 @@ config PCI_ENDPOINT_MSI_DOORBELL
doorbell. The RC can trigger doorbell in EP by writing data to a
dedicated BAR, which the EP maps to the controller's message address.
+config PCI_ENDPOINT_DOE
+ bool "PCI Endpoint Data Object Exchange (DOE) support"
+ depends on PCI_ENDPOINT
+ help
+ This enables support for Data Object Exchange (DOE) protocol
+ on PCI Endpoint controllers. It provides a communication
+ mechanism through mailboxes, primarily used for PCIe security
+ features.
+
+ Say Y here if you want be able to communicate using PCIe DOE
+ mailboxes.
+
+ If unsure, say N.
+
source "drivers/pci/endpoint/functions/Kconfig"
endmenu
diff --git a/drivers/pci/endpoint/Makefile b/drivers/pci/endpoint/Makefile
index b4869d52053a..1fa176b6792b 100644
--- a/drivers/pci/endpoint/Makefile
+++ b/drivers/pci/endpoint/Makefile
@@ -7,3 +7,4 @@ obj-$(CONFIG_PCI_ENDPOINT_CONFIGFS) += pci-ep-cfs.o
obj-$(CONFIG_PCI_ENDPOINT) += pci-epc-core.o pci-epf-core.o\
pci-epc-mem.o functions/
obj-$(CONFIG_PCI_ENDPOINT_MSI_DOORBELL) += pci-ep-msi.o
+obj-$(CONFIG_PCI_ENDPOINT_DOE) += pci-ep-doe.o
diff --git a/drivers/pci/endpoint/pci-ep-doe.c b/drivers/pci/endpoint/pci-ep-doe.c
new file mode 100644
index 000000000000..63a5c1b8753a
--- /dev/null
+++ b/drivers/pci/endpoint/pci-ep-doe.c
@@ -0,0 +1,553 @@
+// SPDX-License-Identifier: GPL-2.0-only OR MIT
+/*
+ * Data Object Exchange for PCIe Endpoint
+ * PCIe r7.0, sec 6.30 DOE
+ *
+ * Copyright (C) 2026 Texas Instruments Incorporated - https://www.ti.com
+ * Aksh Garg <a-garg7@ti.com>
+ * Siddharth Vadapalli <s-vadapalli@ti.com>
+ */
+
+#define dev_fmt(fmt) "DOE EP: " fmt
+
+#include <linux/bitfield.h>
+#include <linux/device.h>
+#include <linux/pci.h>
+#include <linux/pci-epc.h>
+#include <linux/pci-doe.h>
+#include <linux/slab.h>
+#include <linux/workqueue.h>
+#include <linux/xarray.h>
+
+#include "../pci.h"
+
+/* Forward declaration of discovery protocol handler */
+static int pci_ep_doe_handle_discovery(const void *request, size_t request_sz,
+ void **response, size_t *response_sz);
+
+/**
+ * struct pci_doe_protocol - DOE protocol handler entry
+ * @vid: Vendor ID
+ * @type: Protocol type
+ * @handler: Handler function pointer
+ */
+struct pci_doe_protocol {
+ u16 vid;
+ u8 type;
+ pci_doe_protocol_handler_t handler;
+};
+
+/**
+ * struct pci_ep_doe_mb - State for a single DOE mailbox on EP
+ *
+ * This state is used to manage a single DOE mailbox capability on the
+ * endpoint side.
+ *
+ * @epc: PCI endpoint controller this mailbox belongs to
+ * @func_no: Physical function number of the function this mailbox belongs to
+ * @cap_offset: Capability offset
+ * @work_queue: Queue of work items
+ * @flags: Bit array of PCI_DOE_FLAG_* flags
+ */
+struct pci_ep_doe_mb {
+ struct pci_epc *epc;
+ u8 func_no;
+ u16 cap_offset;
+ struct workqueue_struct *work_queue;
+ unsigned long flags;
+};
+
+/**
+ * struct pci_ep_doe_task - Represents a single DOE request/response task
+ *
+ * @feat: DOE feature (vendor ID and type)
+ * @request_pl: Request payload
+ * @request_pl_sz: Size of request payload in bytes
+ * @response_pl: Response buffer
+ * @response_pl_sz: Size of response buffer in bytes
+ * @complete: Completion callback
+ * @work: Work structure for workqueue
+ * @doe_mb: DOE mailbox handling this task
+ */
+struct pci_ep_doe_task {
+ struct pci_doe_feature feat;
+ const void *request_pl;
+ size_t request_pl_sz;
+ void *response_pl;
+ size_t response_pl_sz;
+ pci_ep_doe_complete_t complete;
+
+ /* Initialized by pci_ep_doe_submit_task() */
+ struct work_struct work;
+ struct pci_ep_doe_mb *doe_mb;
+};
+
+/*
+ * Global registry of protocol handlers.
+ * When a new DOE protocol, library is added, add an entry to this array.
+ */
+static const struct pci_doe_protocol pci_doe_protocols[] = {
+ {
+ .vid = PCI_VENDOR_ID_PCI_SIG,
+ .type = PCI_DOE_FEATURE_DISCOVERY,
+ .handler = pci_ep_doe_handle_discovery,
+ },
+};
+
+/*
+ * Combines function number and capability offset into a unique lookup key
+ * for storing/retrieving DOE mailboxes in an xarray.
+ */
+#define PCI_DOE_MB_KEY(func, offset) \
+ (((unsigned long)(func) << 16) | (offset))
+#define PCI_DOE_PROTOCOL_COUNT ARRAY_SIZE(pci_doe_protocols)
+
+/**
+ * pci_ep_doe_init() - Initialize the DOE framework for a controller in EP mode
+ * @epc: PCI endpoint controller
+ *
+ * Initialize the DOE framework data structures. This only initializes
+ * the xarray that will hold the mailboxes.
+ *
+ * Returns: 0 on success, -errno on failure
+ */
+int pci_ep_doe_init(struct pci_epc *epc)
+{
+ if (!epc)
+ return -EINVAL;
+
+ xa_init(&epc->doe_mbs);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(pci_ep_doe_init);
+
+/**
+ * pci_ep_doe_add_mailbox() - Add a DOE mailbox for a physical function
+ * @epc: PCI endpoint controller
+ * @func_no: Physical function number
+ * @cap_offset: Offset of the DOE capability
+ *
+ * Create and register a DOE mailbox for the specified physical function
+ * and capability offset.
+ *
+ * EPC core driver calls this for each DOE capability discovered in the config
+ * space of each endpoint function if DOE support is available for the EPC.
+ *
+ * Returns: 0 on success, -errno on failure
+ */
+int pci_ep_doe_add_mailbox(struct pci_epc *epc, u8 func_no, u16 cap_offset)
+{
+ struct pci_ep_doe_mb *doe_mb;
+ unsigned long key;
+ int ret;
+
+ if (!epc)
+ return -EINVAL;
+
+ doe_mb = kzalloc_obj(*doe_mb, GFP_KERNEL);
+ if (!doe_mb)
+ return -ENOMEM;
+
+ doe_mb->epc = epc;
+ doe_mb->func_no = func_no;
+ doe_mb->cap_offset = cap_offset;
+
+ doe_mb->work_queue = alloc_ordered_workqueue("pci_ep_doe[%s:pf%d:offset%x]", 0,
+ dev_name(&epc->dev),
+ func_no, cap_offset);
+ if (!doe_mb->work_queue) {
+ dev_err(epc->dev.parent,
+ "[pf%d:offset%x] failed to allocate work queue\n",
+ func_no, cap_offset);
+ ret = -ENOMEM;
+ goto err_free;
+ }
+
+ /* Add to xarray with composite key */
+ key = PCI_DOE_MB_KEY(func_no, cap_offset);
+ ret = xa_insert(&epc->doe_mbs, key, doe_mb, GFP_KERNEL);
+ if (ret) {
+ dev_err(epc->dev.parent,
+ "[pf%d:offset%x] failed to insert mailbox: %d\n",
+ func_no, cap_offset, ret);
+ goto err_destroy;
+ }
+
+ dev_dbg(epc->dev.parent,
+ "DOE mailbox added: pf%d offset 0x%x\n",
+ func_no, cap_offset);
+
+ return 0;
+
+err_destroy:
+ destroy_workqueue(doe_mb->work_queue);
+err_free:
+ kfree(doe_mb);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(pci_ep_doe_add_mailbox);
+
+/**
+ * pci_ep_doe_cancel_tasks() - Cancel all pending tasks
+ * @doe_mb: DOE mailbox
+ *
+ * Cancel all pending tasks in the mailbox. Mark the mailbox as dead
+ * so no new tasks can be submitted.
+ */
+static void pci_ep_doe_cancel_tasks(struct pci_ep_doe_mb *doe_mb)
+{
+ if (!doe_mb)
+ return;
+
+ /* Mark the mailbox as dead */
+ set_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags);
+
+ /* Stop all pending work items from starting */
+ set_bit(PCI_DOE_FLAG_CANCEL, &doe_mb->flags);
+}
+
+/**
+ * pci_ep_doe_get_mailbox() - Get DOE mailbox by function and offset
+ * @epc: PCI endpoint controller
+ * @func_no: Physical function number
+ * @cap_offset: Offset of the DOE capability
+ *
+ * Internal helper to look up a DOE mailbox by its function number and
+ * capability offset.
+ *
+ * Returns: Pointer to the mailbox or NULL if not found
+ */
+static struct pci_ep_doe_mb *pci_ep_doe_get_mailbox(struct pci_epc *epc,
+ u8 func_no, u16 cap_offset)
+{
+ unsigned long key;
+
+ if (!epc)
+ return NULL;
+
+ key = PCI_DOE_MB_KEY(func_no, cap_offset);
+ return xa_load(&epc->doe_mbs, key);
+}
+
+/**
+ * pci_ep_doe_find_protocol() - Find protocol handler in static array
+ * @vendor: Vendor ID
+ * @type: Protocol type
+ *
+ * Look up a protocol handler in the static protocol array by matching vendor ID
+ * and protocol type.
+ *
+ * Returns: Handler function pointer or NULL if not found
+ */
+static pci_doe_protocol_handler_t pci_ep_doe_find_protocol(u16 vendor, u8 type)
+{
+ int i;
+
+ /* Search static protocol array */
+ for (i = 0; i < PCI_DOE_PROTOCOL_COUNT; i++) {
+ if (pci_doe_protocols[i].vid == vendor &&
+ pci_doe_protocols[i].type == type)
+ return pci_doe_protocols[i].handler;
+ }
+
+ return NULL;
+}
+
+/**
+ * pci_ep_doe_handle_discovery() - Handle Discovery protocol request
+ * @request: Request payload
+ * @request_sz: Request size
+ * @response: Output pointer for response buffer
+ * @response_sz: Output pointer for response size
+ *
+ * Handle the DOE Discovery protocol. The request contains an index specifying
+ * which protocol to query. This function creates a response containing the
+ * vendor ID and protocol type for the requested index, along with the next
+ * index value for further discovery:
+ *
+ * - next_index = 0: Signals this is the last protocol supported
+ * - next_index = n (non-zero): Signals more protocols available,
+ * query index n next
+ *
+ * Returns: 0 on success, -errno on failure
+ */
+static int pci_ep_doe_handle_discovery(const void *request, size_t request_sz,
+ void **response, size_t *response_sz)
+{
+ struct pci_doe_protocol protocol;
+ u8 requested_index, next_index;
+ u32 *response_pl;
+ u32 request_pl;
+ u16 vendor;
+ u8 type;
+
+ if (request_sz != sizeof(u32))
+ return -EINVAL;
+
+ request_pl = *(u32 *)request;
+ requested_index = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX, request_pl);
+
+ if (requested_index >= PCI_DOE_PROTOCOL_COUNT)
+ return -EINVAL;
+
+ /* Get protocol from array at requested_index */
+ protocol = pci_doe_protocols[requested_index];
+ vendor = protocol.vid;
+ type = protocol.type;
+
+ /* Calculate next index */
+ next_index = (requested_index + 1 < PCI_DOE_PROTOCOL_COUNT) ? requested_index + 1 : 0;
+
+ response_pl = kzalloc_obj(*response_pl, GFP_KERNEL);
+ if (!response_pl)
+ return -ENOMEM;
+
+ /* Build response */
+ *response_pl = FIELD_PREP(PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID, vendor) |
+ FIELD_PREP(PCI_DOE_DATA_OBJECT_DISC_RSP_3_TYPE, type) |
+ FIELD_PREP(PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX, next_index);
+
+ *response = response_pl;
+ *response_sz = sizeof(*response_pl);
+
+ return 0;
+}
+
+static void signal_task_complete(struct pci_ep_doe_task *task, int status)
+{
+ kfree(task->request_pl);
+ task->complete(task->doe_mb->epc, task->doe_mb->func_no,
+ task->doe_mb->cap_offset, status,
+ task->feat.vid, task->feat.type,
+ task->response_pl, task->response_pl_sz);
+ kfree(task);
+}
+
+/**
+ * doe_ep_task_work() - Work function for processing DOE EP tasks
+ * @work: Work structure
+ *
+ * Process a DOE request by calling the appropriate protocol handler.
+ */
+static void doe_ep_task_work(struct work_struct *work)
+{
+ struct pci_ep_doe_task *task = container_of(work, struct pci_ep_doe_task,
+ work);
+ struct pci_ep_doe_mb *doe_mb = task->doe_mb;
+ pci_doe_protocol_handler_t handler;
+ int rc;
+
+ if (test_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags)) {
+ signal_task_complete(task, -EIO);
+ return;
+ }
+
+ /* Check if request was aborted */
+ if (test_bit(PCI_DOE_FLAG_CANCEL, &doe_mb->flags)) {
+ signal_task_complete(task, -ECANCELED);
+ return;
+ }
+
+ /* Find protocol handler in the array */
+ handler = pci_ep_doe_find_protocol(task->feat.vid, task->feat.type);
+ if (!handler) {
+ dev_warn(doe_mb->epc->dev.parent,
+ "[%d:%x] Unsupported protocol VID=%04x TYPE=%02x\n",
+ doe_mb->func_no, doe_mb->cap_offset,
+ task->feat.vid, task->feat.type);
+ signal_task_complete(task, -EOPNOTSUPP);
+ return;
+ }
+
+ /* Call protocol handler */
+ rc = handler(task->request_pl, task->request_pl_sz,
+ &task->response_pl, &task->response_pl_sz);
+
+ signal_task_complete(task, rc);
+}
+
+/**
+ * pci_ep_doe_submit_task() - Submit a task to be processed
+ * @doe_mb: DOE mailbox
+ * @task: Task to submit
+ *
+ * Submit a DOE task to the workqueue for asynchronous processing.
+ *
+ * Returns: 0 on success, -errno on failure
+ */
+static int pci_ep_doe_submit_task(struct pci_ep_doe_mb *doe_mb,
+ struct pci_ep_doe_task *task)
+{
+ if (test_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags))
+ return -EIO;
+
+ task->doe_mb = doe_mb;
+ INIT_WORK(&task->work, doe_ep_task_work);
+ queue_work(doe_mb->work_queue, &task->work);
+ return 0;
+}
+
+/**
+ * pci_ep_doe_process_request() - Process DOE request on endpoint
+ * @epc: PCI endpoint controller
+ * @func_no: Physical function number
+ * @cap_offset: DOE capability offset
+ * @vendor: Vendor ID from request header
+ * @type: Protocol type from request header
+ * @request: Request payload in CPU-native format
+ * @request_sz: Size of request payload (bytes)
+ * @complete: Callback to invoke upon completion
+ *
+ * Asynchronously process a DOE request received on the endpoint. The request
+ * payload should not include the DOE header (vendor/type/length). Ownership
+ * of the request buffer is transferred to DOE EP core, which frees the buffer
+ * either on error or after the completion callback fires. The protocol handler
+ * will allocate the response buffer, which the caller (controller driver) must
+ * free after use.
+ *
+ * This function returns immediately after queuing the request. The completion
+ * callback will be invoked asynchronously from workqueue context once the
+ * request is processed. The callback receives the function number and capability
+ * offset to identify the mailbox, along with a status code (0 on success, -errno
+ * on failure), and other required arguments.
+ *
+ * As per DOE specification, a mailbox processes one request at a time.
+ * Therefore, this function will never be called concurrently for the same
+ * mailbox by different callers.
+ *
+ * The caller is responsible for the conversion of the received DOE request
+ * with le32_to_cpu() before calling this function.
+ * Similarly, it is responsible for converting the response payload with
+ * cpu_to_le32() before sending it back over the DOE mailbox.
+ *
+ * The caller is also responsible for ensuring that the request size
+ * is within the limits defined by PCI_DOE_MAX_LENGTH.
+ *
+ * Returns: 0 if the request was successfully queued, -errno on failure
+ */
+int pci_ep_doe_process_request(struct pci_epc *epc, u8 func_no, u16 cap_offset,
+ u16 vendor, u8 type, void *request, size_t request_sz,
+ pci_ep_doe_complete_t complete)
+{
+ struct pci_ep_doe_mb *doe_mb;
+ struct pci_ep_doe_task *task;
+ int rc;
+
+ doe_mb = pci_ep_doe_get_mailbox(epc, func_no, cap_offset);
+ if (!doe_mb) {
+ kfree(request);
+ return -ENODEV;
+ }
+
+ task = kzalloc_obj(*task, GFP_KERNEL);
+ if (!task) {
+ kfree(request);
+ return -ENOMEM;
+ }
+
+ task->feat.vid = vendor;
+ task->feat.type = type;
+ task->request_pl = request;
+ task->request_pl_sz = request_sz;
+ task->response_pl = NULL;
+ task->response_pl_sz = 0;
+ task->complete = complete;
+
+ rc = pci_ep_doe_submit_task(doe_mb, task);
+ if (rc) {
+ kfree(request);
+ kfree(task);
+ return rc;
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(pci_ep_doe_process_request);
+
+/**
+ * pci_ep_doe_abort() - Abort DOE operations on a mailbox
+ * @epc: PCI endpoint controller
+ * @func_no: Physical function number
+ * @cap_offset: DOE capability offset
+ *
+ * Abort all queued and wait for in-flight DOE operations to complete for the
+ * specified mailbox. This function is called by the EP controller driver
+ * when the RC sets the ABORT bit in the DOE Control register.
+ *
+ * The function will:
+ *
+ * - Set CANCEL flag to prevent new requests in the queue from starting
+ * - Wait for the currently executing handler to complete (cannot interrupt)
+ * - Flush the workqueue to wait for all requests to be handled appropriately
+ * - Clear CANCEL flag to prepare for new requests
+ *
+ * Returns: 0 on success, -errno on failure
+ */
+int pci_ep_doe_abort(struct pci_epc *epc, u8 func_no, u16 cap_offset)
+{
+ struct pci_ep_doe_mb *doe_mb;
+
+ if (!epc)
+ return -EINVAL;
+
+ doe_mb = pci_ep_doe_get_mailbox(epc, func_no, cap_offset);
+ if (!doe_mb)
+ return -ENODEV;
+
+ /* Set CANCEL flag - worker will abort queued requests */
+ set_bit(PCI_DOE_FLAG_CANCEL, &doe_mb->flags);
+ flush_workqueue(doe_mb->work_queue);
+
+ /* Clear CANCEL flag - mailbox ready for new requests */
+ clear_bit(PCI_DOE_FLAG_CANCEL, &doe_mb->flags);
+
+ dev_dbg(epc->dev.parent,
+ "DOE mailbox aborted: PF%d offset 0x%x\n",
+ func_no, cap_offset);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(pci_ep_doe_abort);
+
+/**
+ * pci_ep_doe_destroy_mb() - Destroy a single DOE mailbox
+ * @doe_mb: DOE mailbox to destroy
+ *
+ * Internal function to destroy a mailbox and free its resources.
+ */
+static void pci_ep_doe_destroy_mb(struct pci_ep_doe_mb *doe_mb)
+{
+ if (!doe_mb)
+ return;
+
+ pci_ep_doe_cancel_tasks(doe_mb);
+
+ if (doe_mb->work_queue)
+ destroy_workqueue(doe_mb->work_queue);
+
+ kfree(doe_mb);
+}
+
+/**
+ * pci_ep_doe_destroy() - Destroy all DOE mailboxes
+ * @epc: PCI endpoint controller
+ *
+ * Destroy all DOE mailboxes and free associated resources.
+ *
+ * The EPC core driver calls this to free all DOE resources,
+ * if DOE support is available for the EPC.
+ */
+void pci_ep_doe_destroy(struct pci_epc *epc)
+{
+ struct pci_ep_doe_mb *doe_mb;
+ unsigned long index;
+
+ if (!epc)
+ return;
+
+ xa_for_each(&epc->doe_mbs, index, doe_mb)
+ pci_ep_doe_destroy_mb(doe_mb);
+
+ xa_destroy(&epc->doe_mbs);
+}
+EXPORT_SYMBOL_GPL(pci_ep_doe_destroy);
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 5844deee2b5f..c4a0e25625e3 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -692,6 +692,13 @@ struct pci_doe_feature {
u8 type;
};
+struct pci_epc;
+
+typedef void (*pci_ep_doe_complete_t)(struct pci_epc *epc, u8 func_no,
+ u16 cap_offset, int status,
+ u16 vendor, u8 type,
+ void *response_pl, size_t response_pl_sz);
+
#ifdef CONFIG_PCI_DOE
void pci_doe_init(struct pci_dev *pdev);
void pci_doe_destroy(struct pci_dev *pdev);
@@ -702,6 +709,38 @@ static inline void pci_doe_destroy(struct pci_dev *pdev) { }
static inline void pci_doe_disconnected(struct pci_dev *pdev) { }
#endif
+#ifdef CONFIG_PCI_ENDPOINT_DOE
+int pci_ep_doe_init(struct pci_epc *epc);
+int pci_ep_doe_add_mailbox(struct pci_epc *epc, u8 func_no, u16 cap_offset);
+int pci_ep_doe_process_request(struct pci_epc *epc, u8 func_no, u16 cap_offset,
+ u16 vendor, u8 type, void *request,
+ size_t request_sz, pci_ep_doe_complete_t complete);
+int pci_ep_doe_abort(struct pci_epc *epc, u8 func_no, u16 cap_offset);
+void pci_ep_doe_destroy(struct pci_epc *epc);
+#else
+static inline int pci_ep_doe_init(struct pci_epc *epc) { return -EOPNOTSUPP; }
+static inline int pci_ep_doe_add_mailbox(struct pci_epc *epc, u8 func_no,
+ u16 cap_offset)
+{
+ return -EOPNOTSUPP;
+}
+
+static inline int pci_ep_doe_process_request(struct pci_epc *epc, u8 func_no,
+ u16 cap_offset, u16 vendor, u8 type,
+ void *request, size_t request_sz,
+ pci_ep_doe_complete_t complete)
+{
+ return -EOPNOTSUPP;
+}
+
+static inline int pci_ep_doe_abort(struct pci_epc *epc, u8 func_no, u16 cap_offset)
+{
+ return -EOPNOTSUPP;
+}
+
+static inline void pci_ep_doe_destroy(struct pci_epc *epc) { }
+#endif
+
#ifdef CONFIG_PCI_NPEM
void pci_npem_create(struct pci_dev *dev);
void pci_npem_remove(struct pci_dev *dev);
diff --git a/include/linux/pci-doe.h b/include/linux/pci-doe.h
index abb9b7ae8029..c46e42f3ce78 100644
--- a/include/linux/pci-doe.h
+++ b/include/linux/pci-doe.h
@@ -22,6 +22,11 @@ struct pci_doe_mb;
/* Max data object length is 2^18 dwords */
#define PCI_DOE_MAX_LENGTH (1 << 18)
+typedef int (*pci_doe_protocol_handler_t)(const void *request,
+ size_t request_sz,
+ void **response,
+ size_t *response_sz);
+
struct pci_doe_mb *pci_find_doe_mailbox(struct pci_dev *pdev, u16 vendor,
u8 type);
diff --git a/include/linux/pci-epc.h b/include/linux/pci-epc.h
index 1eca1264815b..dd26294c8175 100644
--- a/include/linux/pci-epc.h
+++ b/include/linux/pci-epc.h
@@ -182,6 +182,9 @@ struct pci_epc {
unsigned long function_num_map;
int domain_nr;
bool init_complete;
+#ifdef CONFIG_PCI_ENDPOINT_DOE
+ struct xarray doe_mbs;
+#endif
};
/**
--
2.34.1
^ permalink raw reply related
* [PATCH v4 1/4] PCI/DOE: Move common definitions to the header file
From: Aksh Garg @ 2026-05-22 5:24 UTC (permalink / raw)
To: linux-pci, linux-doc, mani, kwilczynski, bhelgaas, corbet, kishon,
skhan, lukas, cassel, alistair
Cc: linux-arm-kernel, linux-kernel, s-vadapalli, danishanwar, srk,
a-garg7
In-Reply-To: <20260522052434.802034-1-a-garg7@ti.com>
Move common macros and structures from drivers/pci/doe.c to
drivers/pci/pci.h to allow reuse across root complex and
endpoint DOE implementations.
PCI_DOE_MAX_LENGTH macro can be used outside the PCI core as well,
hence move the macro to include/linux/pci-doe.h.
These changes prepare the groundwork for the DOE endpoint implementation
that will reuse these common definitions.
Co-developed-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Signed-off-by: Aksh Garg <a-garg7@ti.com>
---
Changes from v3 to v4:
- None.
Changes from v2 to v3:
- Rebased on 7.1-rc1.
Changes since v1:
- Moved the common macros that need not be visible outside the PCI core
to drivers/pci/pci.h instead to include/linux/pci-doe.h as suggested
by Lukas Wunner
- Removed the redundant empty inlines guarded with CONFIG_PCI_DOE in
include/linux/pci-doe.h.
v3: https://lore.kernel.org/all/20260427051725.223704-2-a-garg7@ti.com/
v2: https://lore.kernel.org/all/20260401073022.215805-2-a-garg7@ti.com/
v1: https://lore.kernel.org/all/20260213123603.420941-3-a-garg7@ti.com/
drivers/pci/doe.c | 11 -----------
drivers/pci/pci.h | 9 +++++++++
include/linux/pci-doe.h | 3 +++
3 files changed, 12 insertions(+), 11 deletions(-)
diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c
index 7b41da4ec11a..e8d9e95644b3 100644
--- a/drivers/pci/doe.c
+++ b/drivers/pci/doe.c
@@ -28,12 +28,6 @@
#define PCI_DOE_TIMEOUT HZ
#define PCI_DOE_POLL_INTERVAL (PCI_DOE_TIMEOUT / 128)
-#define PCI_DOE_FLAG_CANCEL 0
-#define PCI_DOE_FLAG_DEAD 1
-
-/* Max data object length is 2^18 dwords */
-#define PCI_DOE_MAX_LENGTH (1 << 18)
-
/**
* struct pci_doe_mb - State for a single DOE mailbox
*
@@ -63,11 +57,6 @@ struct pci_doe_mb {
#endif
};
-struct pci_doe_feature {
- u16 vid;
- u8 type;
-};
-
/**
* struct pci_doe_task - represents a single query/response
*
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 4a14f88e543a..5844deee2b5f 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -683,6 +683,15 @@ struct pci_sriov {
bool drivers_autoprobe; /* Auto probing of VFs by driver */
};
+/* DOE Mailbox state flags */
+#define PCI_DOE_FLAG_CANCEL 0
+#define PCI_DOE_FLAG_DEAD 1
+
+struct pci_doe_feature {
+ u16 vid;
+ u8 type;
+};
+
#ifdef CONFIG_PCI_DOE
void pci_doe_init(struct pci_dev *pdev);
void pci_doe_destroy(struct pci_dev *pdev);
diff --git a/include/linux/pci-doe.h b/include/linux/pci-doe.h
index bd4346a7c4e7..abb9b7ae8029 100644
--- a/include/linux/pci-doe.h
+++ b/include/linux/pci-doe.h
@@ -19,6 +19,9 @@ struct pci_doe_mb;
#define PCI_DOE_FEATURE_CMA 1
#define PCI_DOE_FEATURE_SSESSION 2
+/* Max data object length is 2^18 dwords */
+#define PCI_DOE_MAX_LENGTH (1 << 18)
+
struct pci_doe_mb *pci_find_doe_mailbox(struct pci_dev *pdev, u16 vendor,
u8 type);
--
2.34.1
^ permalink raw reply related
* [PATCH v4 0/4] PCI: Add DOE support for endpoint
From: Aksh Garg @ 2026-05-22 5:24 UTC (permalink / raw)
To: linux-pci, linux-doc, mani, kwilczynski, bhelgaas, corbet, kishon,
skhan, lukas, cassel, alistair
Cc: linux-arm-kernel, linux-kernel, s-vadapalli, danishanwar, srk,
a-garg7
This patch series introduces the framework for supporting the Data
Object Exchange (DOE) feature for PCIe endpoint devices. Please refer
to the documentation added in patch 4 for details on the feature and
implementation architecture.
The implementation provides a common framework for all PCIe endpoint
controllers, not specific to any particular SoC vendor.
The changes since v1 are documented in the respective patch descriptions.
v3: https://lore.kernel.org/all/20260427051725.223704-1-a-garg7@ti.com/
v2: https://lore.kernel.org/all/20260401073022.215805-1-a-garg7@ti.com/
v1 (RFC): https://lore.kernel.org/all/20260213123603.420941-1-a-garg7@ti.com/
Below is a code demonstration showing the integration of DOE-EP APIs with
EPC drivers.
Note: The provided code is just to show how an EPC driver is expected to
utilize the pci_ep_doe_process_request() and pci_ep_doe_abort() APIs,
and might not cover all the corner cases. The below implementation
also expects the EPC hardware to have some memory buffer to store the
data from(for) write_mailbox(read_mailbox) DOE capability registers.
============================================================================
/* ========== DOE Completion Callback (invoked by DOE-EP core) ========== */
static void doe_completion_cb(struct pci_epc *epc, u8 func_no, u16 cap_offset,
int status, u16 vendor, u8 type,
void *response_pl, size_t response_pl_sz)
{
struct epc_driver *drv = epc_get_drvdata(epc);
u32 *response = (u32 *)response_pl;
u32 header1, header2;
int payload_dw, i;
if (status < 0) {
/* Error: set ERROR bit in DOE Status register */
writel(1 << DOE_STATUS_ERROR,
drv->base + PF_DOE_STATUS_REG(func_no, cap_offset));
goto free;
}
if (readl(drv->base + PF_DOE_CTRL_REG(func_no, cap_offset)) & DOE_CTRL_ABORT) {
/* Aborted: do not send response */
goto free;
}
/* Success: write DOE headers first, then response to the read memory */
/* Header 1: Vendor ID (bits 15:0) | Type (bits 23:16) */
header1 = (type << 16) | vendor;
writel(header1, drv->base + PF_DOE_RD_MEMORY_WR_REG(func_no, cap_offset));
/* Header 2: Length in DW (including 2 DW of headers + payload) */
payload_dw = DIV_ROUND_UP(response_pl_sz, sizeof(u32));
header2 = 2 + payload_dw; /* 2 header DWs + payload */
writel(header2, drv->base + PF_DOE_RD_MEMORY_WR_REG(func_no, cap_offset));
/* Set READY bit to signal response ready */
writel(1 << DOE_STATUS_READY,
drv->base + PF_DOE_STATUS_REG(func_no, cap_offset));
/* Write response payload DWORDs to Read memory */
for (i = 0; i < payload_dw; i++)
writel(response[i],
drv->base + PF_DOE_RD_MEMORY_WR_REG(func_no, cap_offset));
/* Wait for the memory to empty before clearing the READY bit */
while (!RD_MEMORY_EMPTY()) {/* wait */}
writel(0 << DOE_STATUS_READY,
drv->base + PF_DOE_STATUS_REG(func_no, cap_offset));
/* unset BUSY bit */
writel(0 << DOE_STATUS_BUSY,
drv->base + PF_DOE_STATUS_REG(func_no, cap_offset));
free:
kfree(response_pl);
}
/* ========== DOE Interrupt Handler (triggered on GO bit from root complex) ========== */
static irqreturn_t doe_interrupt_handler(int irq, void *priv)
{
struct epc_driver *drv = priv;
u16 cap_offset = extract_cap_offset_from_irq(irq);
u8 func_no = extract_func_from_irq(irq);
u32 header1, header2, length_dw, *request;
u16 vendor;
u8 type;
int i, ret;
/* Read first header DWORD: Vendor ID (bits 15:0) | Type (bits 23:16) */
header1 = readl(drv->base + PF_DOE_WR_MEMORY_RD_REG(func_no, cap_offset));
vendor = header1 & 0xFFFF;
type = (header1 >> 16) & 0xFF;
/* Read second header DWORD: Length in DW (includes 2 DW of headers) */
header2 = readl(drv->base + PF_DOE_WR_MEMORY_RD_REG(func_no, cap_offset));
length_dw = header2 & 0x3FFFF; /* Bits 17:0 */
if (!length_dw)
length_dw = PCI_DOE_MAX_LENGTH;
length_dw -= 2; /* Subtract 2 DW of headers to get payload length */
/* Allocate buffer for complete request (headers + payload) */
request = kzalloc(length_dw * sizeof(u32), GFP_ATOMIC);
if (!request) {
writel(1 << DOE_STATUS_ERROR,
drv->base + PF_DOE_STATUS_REG(func_no, cap_offset));
return IRQ_HANDLED;
}
/* Read remaining payload DWORDs from Write memory */
for (i = 0; i < length_dw; i++) {
while (WR_MEMORY_EMPTY()) { /* wait */ }
request[i] = readl(drv->base + PF_DOE_WR_MEMORY_RD_REG(func_no, cap_offset));
}
/* Set BUSY bit */
writel(1 << DOE_STATUS_BUSY,
drv->base + PF_DOE_STATUS_REG(func_no, cap_offset));
/* Hand off to DOE-EP core for asynchronous processing */
ret = pci_ep_doe_process_request(drv->epc, func_no, cap_offset,
vendor, type, (void *)request,
length_dw * sizeof(u32),
doe_completion_cb);
if (ret) {
writel(1 << DOE_STATUS_ERROR,
drv->base + PF_DOE_STATUS_REG(func_no, cap_offset));
kfree(request);
}
return IRQ_HANDLED;
}
/* ========== Abort Handler (triggered on ABORT bit from root complex) ========== */
static irqreturn_t doe_abort_handler(int irq, void *priv)
{
struct epc_driver *drv = priv;
u16 cap_offset = extract_cap_offset_from_irq(irq);
u8 func_no = extract_func_from_irq(irq);
/* Abort pending/in-flight operations in DOE-EP core */
pci_ep_doe_abort(drv->epc, func_no, cap_offset);
/* Discard Write memory contents */
writel(DOE_WR_MEMORY_CTRL_DISCARD,
drv->base + PF_DOE_WR_MEMORY_CTRL_REG(func_no, cap_offset));
/* Clear status bits */
writel((0 << DOE_STATUS_ERROR) | (0 << DOE_STATUS_BUSY) |
(0 << DOE_STATUS_READY),
drv->base + PF_DOE_STATUS_REG(func_no, cap_offset));
return IRQ_HANDLED;
}
====================================================================================
Aksh Garg (4):
PCI/DOE: Move common definitions to the header file
PCI: endpoint: Add DOE mailbox support for endpoint functions
PCI: endpoint: Add support for DOE initialization and setup in EPC
core
Documentation: PCI: Add documentation for DOE endpoint support
Documentation/PCI/endpoint/index.rst | 1 +
.../PCI/endpoint/pci-endpoint-doe.rst | 329 +++++++++++
drivers/pci/doe.c | 11 -
drivers/pci/endpoint/Kconfig | 14 +
drivers/pci/endpoint/Makefile | 1 +
drivers/pci/endpoint/pci-ep-doe.c | 553 ++++++++++++++++++
drivers/pci/endpoint/pci-epc-core.c | 92 +++
drivers/pci/pci.h | 48 ++
include/linux/pci-doe.h | 8 +
include/linux/pci-epc.h | 9 +
10 files changed, 1055 insertions(+), 11 deletions(-)
create mode 100644 Documentation/PCI/endpoint/pci-endpoint-doe.rst
create mode 100644 drivers/pci/endpoint/pci-ep-doe.c
--
2.34.1
^ permalink raw reply
* Re: [PATCH v11 1/2] kunit: add tests for smp_cond_load_*_timeout()
From: Ankur Arora @ 2026-05-22 5:10 UTC (permalink / raw)
To: Andrew Morton
Cc: Ankur Arora, linux-kernel, linux-arch, linux-arm-kernel, linux-pm,
linux-next, bpf, arnd, catalin.marinas, will, peterz,
mark.rutland, harisokn, cl, ast, rafael, daniel.lezcano, memxor,
zhenglifeng1, xueshuai, rdunlap, david.laight.linux,
joao.m.martins, boris.ostrovsky, konrad.wilk, ashok.bhat,
Boqun Feng, Boqun Feng, Christoph Lameter, Ingo Molnar,
Konrad Dybcio, Mark Brown
In-Reply-To: <20260521144629.11ee1e2ff88c5387a589f54a@linux-foundation.org>
Andrew Morton <akpm@linux-foundation.org> writes:
> On Thu, 21 May 2026 01:30:37 -0700 Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> Add success and failure case tests for smp_cond_load_*_timeout().
>>
>> All of the test cases wait on some state in smp_cond_load_*_timeout().
>> In the success case we spawn a kthread that pokes the bit.
>>
>> Success or failure cases depend on the expected bit being set (or not).
>> Additionally in failure cases smp_cond_load_*_timeout() cannot return
>> before timeout.
>
> Your title "kunit: ..." implies that this is a kunit patchset. ie, and
> to my mind, this implies that the patchset makes changes to kunit
> infrastructure. This isn't the case, of course.
Aah yes. I did miss that.
>> ...
>>
>> +config BARRIER_TIMEOUT_TEST
>> + tristate "KUnit tests for smp_cond_load_relaxed_timeout()"
>
> The patchset which attempted to add smp_cond_load_relaxed_timeout() was
> titled "barrier: ...", which is a more appropriate identifier for this
> patchset.
>
> Anyway, I removed the series "barrier: Add
> smp_cond_load_{relaxed,acquire}_timeout()" from mm.git on May 19 for
> forgotten reasons. So I suggest that the series be respun and resent,
> with these two test patches included.
Will do.
> Again, I'm really not the guy to be handling these patches. But a 4-6%
> performance improvement in real world workloads is not to be ignored.
>
> Not by me, anyway.
:).
Thanks
--
ankur
^ permalink raw reply
* Re: [PATCH 02/10] [v3] input: gpio-keys: make legacy gpiolib optional
From: Matti Vaittinen @ 2026-05-22 4:55 UTC (permalink / raw)
To: Arnd Bergmann, linux-gpio
Cc: linux-kernel, Arnd Bergmann, Christian Lamparter, Johannes Berg,
Aaro Koskinen, Andreas Kemnade, Kevin Hilman, Roger Quadros,
Tony Lindgren, Thomas Bogendoerfer, John Paul Adrian Glaubitz,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Linus Walleij, Bartosz Golaszewski,
Dmitry Torokhov, Lee Jones, Pavel Machek, Florian Fainelli,
Jonas Gorski, Andrew Lunn, Vladimir Oltean, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, linux-wireless,
linux-omap, linux-arm-kernel, linux-mips, linux-sh, linux-input,
linux-leds, netdev
In-Reply-To: <20260520183815.2510387-3-arnd@kernel.org>
On 20/05/2026 21:38, Arnd Bergmann wrote:
> From: Arnd Bergmann <arnd@arndb.de>
>
> Most users of gpio-keys and gpio-keys-polled use modern gpiolib
> interfaces, but there are still number of ancient sh, arm32 and x86
> machines that have never been converted.
>
> Add an #ifdef block for the parts of the driver that are only used on
> those legacy machines.
>
> The two Rohm PMIC drivers use a gpio-keys device without an actual GPIO,
> passing an IRQ number instead. In order to keep this working both with
> and with CONFIG_GPIOLIB_LEGACY, change the gpio-keys driver to ignore
> the gpio number if an IRQ is passed.
>
> Link: https://lore.kernel.org/all/b3c94552-c104-42e3-be15-7e8362e8039e@gmail.com/
> Link: https://lore.kernel.org/all/afJXG4_rtaj3l2Dk@google.com/
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
> v3: resend
> v2: skip the fake GPIO number passing from mfd
>
> The removal of the arm platforms using this is not yet going to happen
> for 7.2, and Dmitry's changes for the Rohm drivers have not yet
> made it into linux-next as of 2026-05-20, so for the moment I
> would still like to see this patch get merged, even if we are
> closing in on completely removing the legacy gpio support in
> the gpio_keys driver, so we can make CONFIG_GPIOLIB_LEGACY
> default-disabled sooner.
I am (still) all fine with this, even though I like Dmitry's set. I
suppose you already have a plan for merging this, but I still have to
ask - why the MFD changes aren't in own patch? I feel it would have
simplified merging, backporting, reviewing and reverting if needed.
Well, other than that:
Reviewed-by: Matti Vaittinen <mazziesaccount@gmail.com>
Yours,
-- Matti
---
Matti Vaittinen
Linux kernel developer at ROHM Semiconductors
Oulu Finland
~~ When things go utterly wrong vim users can always type :help! ~~
^ permalink raw reply
* Re: [PATCH] arm64: tlb: Flush walk cache when unsharing PMD tables
From: Zeng Heng @ 2026-05-22 4:43 UTC (permalink / raw)
To: Catalin Marinas
Cc: will, akpm, npiggin, aneesh.kumar, peterz, linux-kernel,
wangkefeng.wang, linux-arm-kernel, linux-mm, linux-arch,
David Hildenbrand
In-Reply-To: <ag8fHYL-S26uO0yZ@arm.com>
Hi Catalin,
On 2026/5/21 23:05, Catalin Marinas wrote:
> + David H.
>
> On Thu, May 21, 2026 at 03:30:11PM +0800, Zeng Heng wrote:
>> From: Zeng Heng <zengheng4@huawei.com>
>>
>> When huge_pmd_unshare() is called to unshare a PMD table, the
>> tlb_unshare_pmd_ptdesc() function sets tlb->unshared_tables=true
>> but the aarch64 tlb_flush() only checked tlb->freed_tables to
>> determine whether to use TLBF_NONE (vae1is, invalidates walk
>> cache) or TLBF_NOWALKCACHE (vale1is, leaf-only).
>>
>> This caused the stale PMD page table entry to remain in the walk cache
>> after unshare, potentially leading to incorrect page table walks.
>>
>> Fix by including unshared_tables in the check, so that when
>> unsharing tables, TLBF_NONE is used and the walk cache is properly
>> invalidated.
>>
>> Here is the detailed distinction between vae1is and vale1is:
>>
>> | Instruction Combination | Actual Invalidation Scope |
>> | ------------------------ | --------------------------------------------------|
>> | `VAE1IS` + TTL=`0` | All entries at all levels (full invalidation) |
>> | `VAE1IS` + TTL=`2` (L2) | Non-leaf at Level 0/1 + leaf at Level 2 |
>> | `VALE1IS` + TTL=`0` | Leaf entries at all levels (non-leaf not cleared) |
>> | `VALE1IS` + TTL=`2` (L2) | Leaf entry at Level 2 only |
>>
>> Signed-off-by: Zeng Heng <zengheng4@huawei.com>
> The fix looks fine but does it need:
>
> Fixes: 8ce720d5bd91 ("mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather")
> Cc: <stable@vger.kernel.org>
It makes sense to me. Thanks a lot for that.
Best regards,
Zeng Heng
>> ---
>> arch/arm64/include/asm/tlb.h | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
>> index 10869d7731b8..751bd57bc3ba 100644
>> --- a/arch/arm64/include/asm/tlb.h
>> +++ b/arch/arm64/include/asm/tlb.h
>> @@ -53,7 +53,8 @@ static inline int tlb_get_level(struct mmu_gather *tlb)
>> static inline void tlb_flush(struct mmu_gather *tlb)
>> {
>> struct vm_area_struct vma = TLB_FLUSH_VMA(tlb->mm, 0);
>> - tlbf_t flags = tlb->freed_tables ? TLBF_NONE : TLBF_NOWALKCACHE;
>> + tlbf_t flags = (tlb->freed_tables || tlb->unshared_tables) ?
>> + TLBF_NONE : TLBF_NOWALKCACHE;
>> unsigned long stride = tlb_get_unmap_size(tlb);
>> int tlb_level = tlb_get_level(tlb);
>>
>> --
>> 2.43.0
^ permalink raw reply
* [PATCH v5 20/20] swiotlb: Preserve allocation virtual address for dynamic pools
From: Aneesh Kumar K.V (Arm) @ 2026-05-22 4:28 UTC (permalink / raw)
To: iommu, linux-arm-kernel, linux-kernel, linux-coco
Cc: Aneesh Kumar K.V (Arm), Robin Murphy, Marek Szyprowski,
Will Deacon, Marc Zyngier, Steven Price, Suzuki K Poulose,
Catalin Marinas, Jiri Pirko, Jason Gunthorpe, Mostafa Saleh,
Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <20260522042815.370873-1-aneesh.kumar@kernel.org>
swiotlb_alloc_tlb() can allocate from the DMA atomic pool when a decrypted
pool is needed from atomic context. With CONFIG_DMA_DIRECT_REMAP, the
atomic pool is backed by remapped virtual addresses, which are not the same
as the direct-map addresses returned by phys_to_virt().
swiotlb_init_io_tlb_pool() currently reconstructs the pool virtual address
from the physical start address. For atomic-pool backed allocations this
stores the wrong address in pool->vaddr. Later, swiotlb_free_tlb() passes
that address to dma_free_from_pool(), which will fail to recognize the
chunk
Pass the virtual address returned by the allocation path into
swiotlb_init_io_tlb_pool(), and store that address in pool->vaddr. This
keeps the pool free path using the same virtual address as the allocator.
Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
---
kernel/dma/swiotlb.c | 32 +++++++++++++++++++-------------
1 file changed, 19 insertions(+), 13 deletions(-)
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 14d834ca298b..e4bd8c9eaeda 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -302,9 +302,9 @@ void __init swiotlb_update_mem_attributes(void)
}
static void swiotlb_init_io_tlb_pool(struct io_tlb_pool *mem, phys_addr_t start,
- unsigned long nslabs, bool late_alloc, unsigned int nareas)
+ void *vaddr, unsigned long nslabs, bool late_alloc,
+ unsigned int nareas)
{
- void *vaddr = phys_to_virt(start);
unsigned long bytes = nslabs << IO_TLB_SHIFT, i;
mem->nslabs = nslabs;
@@ -445,7 +445,7 @@ void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
return;
}
- swiotlb_init_io_tlb_pool(mem, __pa(tlb), nslabs, false, nareas);
+ swiotlb_init_io_tlb_pool(mem, __pa(tlb), tlb, nslabs, false, nareas);
add_mem_pool(&io_tlb_default_mem, mem);
if (flags & SWIOTLB_VERBOSE)
@@ -553,7 +553,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
}
}
- swiotlb_init_io_tlb_pool(mem, virt_to_phys(vstart), nslabs, true,
+ swiotlb_init_io_tlb_pool(mem, virt_to_phys(vstart), vstart, nslabs, true,
nareas);
add_mem_pool(&io_tlb_default_mem, mem);
@@ -664,25 +664,26 @@ static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes,
* @phys_limit: Maximum allowed physical address of the buffer.
* @attrs: DMA attributes for the allocation.
* @gfp: GFP flags for the allocation.
+ * @vaddr: Receives the virtual address for the allocated buffer.
*
* Return: Allocated pages, or %NULL on allocation failure.
*/
static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
- u64 phys_limit, unsigned long attrs, gfp_t gfp)
+ u64 phys_limit, unsigned long attrs, gfp_t gfp, void **vaddr)
{
struct page *page;
+ *vaddr = NULL;
+
/*
* Allocate from the atomic pools if memory is encrypted and
* the allocation is atomic, because decrypting may block.
*/
if (!gfpflags_allow_blocking(gfp) && (attrs & DMA_ATTR_CC_SHARED)) {
- void *vaddr;
-
if (!IS_ENABLED(CONFIG_DMA_COHERENT_POOL))
return NULL;
- return dma_alloc_from_pool(dev, bytes, &vaddr, gfp,
+ return dma_alloc_from_pool(dev, bytes, vaddr, gfp,
attrs, dma_coherent_ok);
}
@@ -705,6 +706,8 @@ static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
return NULL;
}
+ if (page)
+ *vaddr = phys_to_virt(page_to_phys(page));
return page;
}
@@ -750,6 +753,7 @@ static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
{
struct io_tlb_pool *pool;
unsigned int slot_order;
+ void *tlb_vaddr;
struct page *tlb;
size_t pool_size;
size_t tlb_size;
@@ -767,7 +771,8 @@ static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
pool->unencrypted = !!(attrs & DMA_ATTR_CC_SHARED);
tlb_size = nslabs << IO_TLB_SHIFT;
- while (!(tlb = swiotlb_alloc_tlb(dev, tlb_size, phys_limit, attrs, gfp))) {
+ while (!(tlb = swiotlb_alloc_tlb(dev, tlb_size, phys_limit, attrs, gfp,
+ &tlb_vaddr))) {
if (nslabs <= minslabs)
goto error_tlb;
nslabs = ALIGN(nslabs >> 1, IO_TLB_SEGSIZE);
@@ -781,12 +786,12 @@ static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
if (!pool->slots)
goto error_slots;
- swiotlb_init_io_tlb_pool(pool, page_to_phys(tlb), nslabs, true, nareas);
+ swiotlb_init_io_tlb_pool(pool, page_to_phys(tlb), tlb_vaddr, nslabs,
+ true, nareas);
return pool;
error_slots:
- swiotlb_free_tlb(page_address(tlb), tlb_size,
- !!(attrs & DMA_ATTR_CC_SHARED));
+ swiotlb_free_tlb(tlb_vaddr, tlb_size, !!(attrs & DMA_ATTR_CC_SHARED));
error_tlb:
kfree(pool);
error:
@@ -1995,7 +2000,8 @@ static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
mem->unencrypted = false;
}
- swiotlb_init_io_tlb_pool(pool, rmem->base, nslabs,
+ swiotlb_init_io_tlb_pool(pool, rmem->base, phys_to_virt(rmem->base),
+ nslabs,
false, nareas);
mem->force_bounce = true;
mem->for_alloc = true;
--
2.43.0
^ permalink raw reply related
* [PATCH v5 19/20] dma: free atomic pool pages by physical address
From: Aneesh Kumar K.V (Arm) @ 2026-05-22 4:28 UTC (permalink / raw)
To: iommu, linux-arm-kernel, linux-kernel, linux-coco
Cc: Aneesh Kumar K.V (Arm), Robin Murphy, Marek Szyprowski,
Will Deacon, Marc Zyngier, Steven Price, Suzuki K Poulose,
Catalin Marinas, Jiri Pirko, Jason Gunthorpe, Mostafa Saleh,
Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <20260522042815.370873-1-aneesh.kumar@kernel.org>
dma_direct_alloc_pages() may satisfy atomic allocations from the coherent
atomic pools. The pool allocation is keyed by the virtual address stored in
the gen_pool, but the pages API returns only the backing struct page.
On architectures with CONFIG_DMA_DIRECT_REMAP, atomic pool chunks are added
to the gen_pool using their remapped virtual address.
dma_direct_free_pages() reconstructs a linear-map address with
page_address(page) and passes that to dma_free_from_pool(). That address
does not match the gen_pool virtual range, so the pool lookup can fail and
the code can fall through to freeing a pool-owned page through the normal
page allocator path.
Add a page-based pool free helper that looks up the owning pool chunk by
physical address, translates it back to the gen_pool virtual address, and
frees that address to the pool. Use it from dma_direct_free_pages() while
keeping the existing virtual-address helper for coherent allocation frees.
Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
---
include/linux/dma-map-ops.h | 1 +
kernel/dma/direct.c | 4 +--
kernel/dma/pool.c | 54 +++++++++++++++++++++++++++++++++++++
3 files changed, 57 insertions(+), 2 deletions(-)
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 696b2c3a2305..8be059e69935 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -215,6 +215,7 @@ struct page *dma_alloc_from_pool(struct device *dev, size_t size,
void **cpu_addr, gfp_t flags, unsigned long attrs,
bool (*phys_addr_ok)(struct device *, phys_addr_t, size_t));
bool dma_free_from_pool(struct device *dev, void *start, size_t size);
+bool dma_free_from_pool_page(struct device *dev, struct page *page, size_t size);
int dma_direct_set_offset(struct device *dev, phys_addr_t cpu_start,
dma_addr_t dma_start, u64 size);
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 907c6084c616..488d53ed21f3 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -488,9 +488,9 @@ void dma_direct_free_pages(struct device *dev, size_t size,
*/
bool mark_mem_encrypted = force_dma_unencrypted(dev);
- /* If cpu_addr is not from an atomic pool, dma_free_from_pool() fails */
+ /* If page is not from an atomic pool, dma_free_from_pool_page() fails */
if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) &&
- dma_free_from_pool(dev, vaddr, size))
+ dma_free_from_pool_page(dev, page, size))
return;
phys = page_to_phys(page);
diff --git a/kernel/dma/pool.c b/kernel/dma/pool.c
index e7df8d279e75..43b8101d860f 100644
--- a/kernel/dma/pool.c
+++ b/kernel/dma/pool.c
@@ -356,3 +356,57 @@ bool dma_free_from_pool(struct device *dev, void *start, size_t size)
return false;
}
+
+struct dma_pool_phys_match {
+ phys_addr_t phys;
+ size_t size;
+ unsigned long addr;
+ bool found;
+};
+
+static void dma_pool_find_phys(struct gen_pool *pool, struct gen_pool_chunk *chunk,
+ void *data)
+{
+ struct dma_pool_phys_match *match = data;
+ phys_addr_t end = match->phys + match->size - 1;
+ phys_addr_t chunk_end;
+
+ if (match->found)
+ return;
+
+ chunk_end = chunk->phys_addr + (chunk->end_addr - chunk->start_addr);
+ if (match->phys < chunk->phys_addr || end > chunk_end)
+ return;
+
+ match->addr = chunk->start_addr + (match->phys - chunk->phys_addr);
+ match->found = true;
+}
+
+static bool dma_free_from_pool_phys(struct dma_gen_pool *dma_pool, phys_addr_t phys,
+ size_t size)
+{
+ struct dma_pool_phys_match match = {
+ .phys = phys,
+ .size = size,
+ };
+
+ gen_pool_for_each_chunk(dma_pool->pool, dma_pool_find_phys, &match);
+ if (!match.found)
+ return false;
+
+ gen_pool_free(dma_pool->pool, match.addr, size);
+ return true;
+}
+
+bool dma_free_from_pool_page(struct device *dev, struct page *page, size_t size)
+{
+ struct dma_gen_pool *dma_pool = NULL;
+ phys_addr_t phys = page_to_phys(page);
+
+ while ((dma_pool = dma_guess_pool(dma_pool, 0))) {
+ if (dma_free_from_pool_phys(dma_pool, phys, size))
+ return true;
+ }
+
+ return false;
+}
--
2.43.0
^ permalink raw reply related
* [PATCH v5 18/20] dma: swiotlb: handle set_memory_decrypted() failures
From: Aneesh Kumar K.V (Arm) @ 2026-05-22 4:28 UTC (permalink / raw)
To: iommu, linux-arm-kernel, linux-kernel, linux-coco
Cc: Aneesh Kumar K.V (Arm), Robin Murphy, Marek Szyprowski,
Will Deacon, Marc Zyngier, Steven Price, Suzuki K Poulose,
Catalin Marinas, Jiri Pirko, Jason Gunthorpe, Mostafa Saleh,
Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <20260522042815.370873-1-aneesh.kumar@kernel.org>
Check the return value when converting swiotlb pools between encrypted and
decrypted mappings. If the default pool cannot be decrypted after early
initialization, mark the pool fully used so it cannot satisfy future bounce
allocations.
For late initialization, return the `set_memory_decrypted()` failure. For
restricted DMA pools, fail device initialization if the reserved pool
cannot be decrypted.
This prevents swiotlb from using pools whose encryption attributes do not
match their metadata, and avoids returning pages with uncertain encryption
state back to the allocator.
Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
---
kernel/dma/swiotlb.c | 80 +++++++++++++++++++++++++++++++++++---------
1 file changed, 65 insertions(+), 15 deletions(-)
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 4c56f64602ea..14d834ca298b 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -248,6 +248,23 @@ static inline unsigned long nr_slots(u64 val)
return DIV_ROUND_UP(val, IO_TLB_SIZE);
}
+static void swiotlb_mark_pool_used(struct io_tlb_pool *pool)
+{
+ unsigned long i;
+
+ for (i = 0; i < pool->nareas; i++) {
+ pool->areas[i].index = 0;
+ pool->areas[i].used = pool->area_nslabs;
+ }
+
+ for (i = 0; i < pool->nslabs; i++) {
+ pool->slots[i].list = 0;
+ pool->slots[i].orig_addr = INVALID_PHYS_ADDR;
+ pool->slots[i].alloc_size = 0;
+ pool->slots[i].pad_slots = 0;
+ }
+}
+
/*
* Early SWIOTLB allocation may be too early to allow an architecture to
* perform the desired operations. This function allows the architecture to
@@ -272,8 +289,16 @@ void __init swiotlb_update_mem_attributes(void)
return;
bytes = PAGE_ALIGN(mem->nslabs << IO_TLB_SHIFT);
- if (io_tlb_default_mem.unencrypted)
- set_memory_decrypted((unsigned long)mem->vaddr, bytes >> PAGE_SHIFT);
+ if (io_tlb_default_mem.unencrypted) {
+ int ret;
+
+ ret = set_memory_decrypted((unsigned long)mem->vaddr,
+ bytes >> PAGE_SHIFT);
+ if (ret) {
+ pr_warn("Failed to decrypt default memory pool, disabling it\n");
+ swiotlb_mark_pool_used(mem);
+ }
+ }
}
static void swiotlb_init_io_tlb_pool(struct io_tlb_pool *mem, phys_addr_t start,
@@ -442,9 +467,10 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
{
struct io_tlb_pool *mem = &io_tlb_default_mem.defpool;
unsigned long nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
+ unsigned int order, area_order, slot_order;
+ bool leak_pages = false;
unsigned int nareas;
unsigned char *vstart = NULL;
- unsigned int order, area_order;
bool retried = false;
int rc = 0;
@@ -504,6 +530,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
(PAGE_SIZE << order) >> 20);
}
+ rc = -ENOMEM;
nareas = limit_nareas(default_nareas, nslabs);
area_order = get_order(array_size(sizeof(*mem->areas), nareas));
mem->areas = (struct io_tlb_area *)
@@ -511,14 +538,20 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
if (!mem->areas)
goto error_area;
+ slot_order = get_order(array_size(sizeof(*mem->slots), nslabs));
mem->slots = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
- get_order(array_size(sizeof(*mem->slots), nslabs)));
+ slot_order);
if (!mem->slots)
goto error_slots;
- if (io_tlb_default_mem.unencrypted)
- set_memory_decrypted((unsigned long)vstart,
- (nslabs << IO_TLB_SHIFT) >> PAGE_SHIFT);
+ if (io_tlb_default_mem.unencrypted) {
+ rc = set_memory_decrypted((unsigned long)vstart,
+ (nslabs << IO_TLB_SHIFT) >> PAGE_SHIFT);
+ if (rc) {
+ leak_pages = true;
+ goto error_decrypt;
+ }
+ }
swiotlb_init_io_tlb_pool(mem, virt_to_phys(vstart), nslabs, true,
nareas);
@@ -527,16 +560,20 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
swiotlb_print_info();
return 0;
+error_decrypt:
+ free_pages((unsigned long)mem->slots, slot_order);
error_slots:
free_pages((unsigned long)mem->areas, area_order);
error_area:
- free_pages((unsigned long)vstart, order);
- return -ENOMEM;
+ if (!leak_pages)
+ free_pages((unsigned long)vstart, order);
+ return rc;
}
void __init swiotlb_exit(void)
{
struct io_tlb_pool *mem = &io_tlb_default_mem.defpool;
+ bool leak_pages = false;
unsigned long tbl_vaddr;
size_t tbl_size, slots_size;
unsigned int area_order;
@@ -552,19 +589,23 @@ void __init swiotlb_exit(void)
tbl_size = PAGE_ALIGN(mem->end - mem->start);
slots_size = PAGE_ALIGN(array_size(sizeof(*mem->slots), mem->nslabs));
- if (io_tlb_default_mem.unencrypted)
- set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT);
+ if (io_tlb_default_mem.unencrypted) {
+ if (set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT))
+ leak_pages = true;
+ }
if (mem->late_alloc) {
area_order = get_order(array_size(sizeof(*mem->areas),
mem->nareas));
free_pages((unsigned long)mem->areas, area_order);
- free_pages(tbl_vaddr, get_order(tbl_size));
+ if (!leak_pages)
+ free_pages(tbl_vaddr, get_order(tbl_size));
free_pages((unsigned long)mem->slots, get_order(slots_size));
} else {
memblock_free(mem->areas,
array_size(sizeof(*mem->areas), mem->nareas));
- memblock_phys_free(mem->start, tbl_size);
+ if (!leak_pages)
+ memblock_phys_free(mem->start, tbl_size);
memblock_free(mem->slots, slots_size);
}
@@ -1938,9 +1979,18 @@ static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
* restricted mem pool is decrypted by default
*/
if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
+ int ret;
+
mem->unencrypted = true;
- set_memory_decrypted((unsigned long)phys_to_virt(rmem->base),
- rmem->size >> PAGE_SHIFT);
+ ret = set_memory_decrypted((unsigned long)phys_to_virt(rmem->base),
+ rmem->size >> PAGE_SHIFT);
+ if (ret) {
+ dev_err(dev, "Failed to decrypt restricted DMA pool\n");
+ kfree(pool->areas);
+ kfree(pool->slots);
+ kfree(mem);
+ return ret;
+ }
} else {
mem->unencrypted = false;
}
--
2.43.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox