Linux-ARM-Kernel Archive on lore.kernel.org

Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH net-next] net: stmmac: enable RPS and RBU interrupts
From: Sam Edwards @ 2026-04-15 17:38 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Jakub Kicinski, Andrew Lunn, Alexandre Torgue, Andrew Lunn,
	David S. Miller, Eric Dumazet,
	moderated list:BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE,
	linux-stm32, Linux Network Development Mailing List, Paolo Abeni
In-Reply-To: <ad-ID2WaPgPJqdsa@shell.armlinux.org.uk>

On Wed, Apr 15, 2026 at 5:44 AM Russell King (Oracle)
<linux@armlinux.org.uk> wrote:
>
> On Tue, Apr 14, 2026 at 07:12:34PM -0700, Sam Edwards wrote:
> > On Tue, Apr 14, 2026 at 6:19 PM Russell King (Oracle)
> > <linux@armlinux.org.uk> wrote:
> > > Okay, just a quick note to say that nvidia's 5.10.216-tegra kernel
> > > survives iperf3 -c -R to the imx6.
> >
> > Hi Russell,
> >
> > Aw, you beat me to it! I was about to report that 5.10.104-tegra is
> > unaffected. And my iperf3 server is a multi-GbE amd64 machine.
> >
> > > Dumping the registers and comparing, and then forcing the RQS and TQS
> > > values to 0x23 (+1 = 36, *256 = 9216 bytes) and 0x8f (+1 = 144,
> > > *256 = 36864 ytes) respectively seems to solve the problem. Under
> > > net-next, these both end up being 0xff (+1 = 256, *256 = 65536 bytes.)
> > > Suspiciously, 36 * 4 = 144, and I also see that this kernel programs
> > > all four of the MTL receive operation mode registers, but only the
> > > first MTL transmit operation mode register. However, DMA channels 1-3
> > > aren't initialised.
> >
> > Wow, great! I wonder if the problem is that the MTL FIFOs are smaller
> > than that, so when the DMA suffers a momentary hiccup, the FIFOs are
> > allowed to overflow, putting the hardware in a bad state.
> >
> > Though I suspect this is only half of the problem: do you still see
> > RBUs? Everything you've shared so far suggests the DMA failures are
> > _not_ because the rx ring is drying up.
>
> Yes. Note that RBUs will happen not because of DMA failures, but if
> the kernel fails to keep up with the packet rate. RBU means "we read
> the next descriptor, and it wasn't owned by hardware".

Are you speaking from observation, documentation, or understanding?
I'd define RBU the same way, but you reported:

```
[   55.766199] dwc-eth-dwmac 2490000.ethernet eth0: q0: receive buffer
unavailable: cur_rx=309 dirty_rx=309 last_cur_rx=245
last_cur_rx_post=309 last_dirty_rx=245 count=64 budget=64

cur_rx == dirty_rx _should_ mean that we fully refilled the ring. [...]
[...]
Every ring entry contains the same RDES3 value, so it really is
completely full at the point RBU fires (bit 31 clear means software
owns the descriptor, and it's basically saying first/last segment,
RDES1 valid, buffer 1 length of 1518.
```

It would seem* that the kernel isn't really failing to keep up with
the packet rate. If RBU is firing with a ring that's not even close to
empty, that tells me there's another way for it to fire. So I suspect
the hardware designers implemented it to mean:
"We couldn't read the next descriptor, _or_ it wasn't owned by hardware."

(* However, if bit 31 is clear everywhere, wouldn't that mean the ring
is actually completely depleted, not full? If count==budget, wouldn't
that mean the whole ring hasn't been visited, so we only refilled 64
entries and not necessarily the entire ring? Maybe the kernel isn't
keeping up after all.)

> That has:
>
>         const nveu32_t rx_fifo_sz[2U][OSI_EQOS_MAX_NUM_QUEUES] = {
>                 { FIFO_SZ(9U), FIFO_SZ(9U), FIFO_SZ(9U), FIFO_SZ(9U),
>                   FIFO_SZ(1U), FIFO_SZ(1U), FIFO_SZ(1U), FIFO_SZ(1U) },
>                 { FIFO_SZ(36U), FIFO_SZ(2U), FIFO_SZ(2U), FIFO_SZ(2U),
>                   FIFO_SZ(2U), FIFO_SZ(2U), FIFO_SZ(2U), FIFO_SZ(16U) },
>         };
>         const nveu32_t tx_fifo_sz[2U][OSI_EQOS_MAX_NUM_QUEUES] = {
>                 { FIFO_SZ(9U), FIFO_SZ(9U), FIFO_SZ(9U), FIFO_SZ(9U),
>                   FIFO_SZ(1U), FIFO_SZ(1U), FIFO_SZ(1U), FIFO_SZ(1U) },
>                 { FIFO_SZ(8U), FIFO_SZ(8U), FIFO_SZ(8U), FIFO_SZ(8U),
>                   FIFO_SZ(8U), FIFO_SZ(8U), FIFO_SZ(8U), FIFO_SZ(8U) },
>         };
>
> where each of those values is the RQS/TQS value to use in KiB:
>
> #define FIFO_SZ(x)              ((((x) * 1024U) / 256U) - 1U)
>
> This doesn't correspond with the values I'm seeing programmed into
> the hardware under the 5.10.216-tegra kernel. I'm seeing TQS = 143
> (36KiB), and RQS = 35 (9KiB). Yes, these values exist in the tables
> above from a quick look, but they're not in the right place!

True, but:
a) I doubt 5.10.216-tegra includes exactly the same version of the
driver found in this random GitHub mirror. (My intent was only to
point out that they don't use 5.10's stmmac; I should have been more
clear that I wasn't trying to link the same version, sorry!)
b) This is vendor code; I don't know how good their testing/review
process is. It might not run the way it looks. The intent seems to be
for RQS > TQS (which makes intuitive sense), but as you're seeing the
registers programmed the other way 'round, they might have gotten them
subtly mixed up.

> Now, as for FIFO sizes, if we sum up all the entries, then we
> get:
>
> SUM(rx_fifo_size[0][]) = 60KiB
> SUM(rx_fifo_size[1][]) = 64KiB
> SUM(tx_fifo_size[0][]) = 60KiB
> SUM(tx_fifo_size[1][]) = 64KiB

I follow the math with 64KiB, but surely the 60KiB should be
9+9+9+9+1+1+1+1=40KiB? This seems to me that the "legacy EQOS" simply
shifts with smaller FIFOs. Since dwmac is licensed as a soft IP core,
perhaps the FIFO size is an elaboration parameter? That would mean
this isn't an issue with dwmac 5.0 broadly, but with Nvidia's specific
instantiation of it.

^ permalink raw reply

* Re: [PATCH] arm: fix race condition on PG_dcache_clean in __sync_icache_dcache()
From: Will Deacon @ 2026-04-15 17:15 UTC (permalink / raw)
  To: Brian Ruley; +Cc: Russell King, catalin.marinas, linux-arm-kernel, linux-kernel
In-Reply-To: <ad_F0YXDx2g_Zb7R@willie-the-truck>

On Wed, Apr 15, 2026 at 06:07:34PM +0100, Will Deacon wrote:
> On Tue, Apr 14, 2026 at 02:39:18PM +0300, Brian Ruley wrote:
> > This bug was already discovered and fixed for arm64 in
> > commit 588a513d3425 ("arm64: Fix race condition on PG_dcache_clean in
> > __sync_icache_dcache()").
> > 
> > Verified with added instrumentation to track dcache flushes in a ring
> > buffer, as shown by the (distilled) output:
> > 
> >   kernel: SIGILL at b6b80ac0 cpu 1 pid 32663 linux_pte=8eff659f
> >           hw_pte=8eff6e7e young=1 exec=1
> >   kernel: dcache flush START   cpu0 pfn=8eff6 ts=48629557020154
> >   kernel: dcache flush SKIPPED cpu1 pfn=8eff6 ts=48629557020154
> >   kernel: dcache flush FINISH  cpu0 pfn=8eff6 ts=48629557036154
> >   audisp-syslog: comm="journalctl" exe="/usr/bin/journalctl" sig=4 [...]
> > 
> > Discussions in the mailing list mentioned that arch/arm is also affected
> > but the fix was never applied to it [1][2]. Apply the change now, since
> > the race condition can cause sporadic SIGILL's and SEGV's especially
> > while under high memory pressure.
> > 
> > Link: https://lore.kernel.org/all/adzMOdySgMIePcue@willie-the-truck [1]
> > Link: https://lore.kernel.org/all/20210514095001.13236-1-catalin.marinas@arm.com [2]
> > Signed-off-by: Brian Ruley <brian.ruley@gehealthcare.com>
> > ---
> >  arch/arm/mm/flush.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
> > index 19470d938b23..4d7ef5cc36b6 100644
> > --- a/arch/arm/mm/flush.c
> > +++ b/arch/arm/mm/flush.c
> > @@ -304,8 +304,10 @@ void __sync_icache_dcache(pte_t pteval)
> >  	else
> >  		mapping = NULL;
> >  
> > -	if (!test_and_set_bit(PG_dcache_clean, &folio->flags.f))
> > +	if (!test_bit(PG_dcache_clean, &folio->flags.f)) {
> >  		__flush_dcache_folio(mapping, folio);
> > +		set_bit(PG_dcache_clean, &folio->flags.f);
> > +	}
> >  
> >  	if (pte_exec(pteval))
> >  		__flush_icache_all();
> 
> Thanks, Brian:
> 
> Reviewed-by: Will Deacon <will@kernel.org>

I've uploaded this to the patch tracker [1] as 9472/1.

Will

[1] https://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=9472/1


^ permalink raw reply

* Re: [PATCH] arm: fix race condition on PG_dcache_clean in __sync_icache_dcache()
From: Will Deacon @ 2026-04-15 17:07 UTC (permalink / raw)
  To: Brian Ruley; +Cc: Russell King, catalin.marinas, linux-arm-kernel, linux-kernel
In-Reply-To: <20260414113919.782-1-brian.ruley@gehealthcare.com>

On Tue, Apr 14, 2026 at 02:39:18PM +0300, Brian Ruley wrote:
> This bug was already discovered and fixed for arm64 in
> commit 588a513d3425 ("arm64: Fix race condition on PG_dcache_clean in
> __sync_icache_dcache()").
> 
> Verified with added instrumentation to track dcache flushes in a ring
> buffer, as shown by the (distilled) output:
> 
>   kernel: SIGILL at b6b80ac0 cpu 1 pid 32663 linux_pte=8eff659f
>           hw_pte=8eff6e7e young=1 exec=1
>   kernel: dcache flush START   cpu0 pfn=8eff6 ts=48629557020154
>   kernel: dcache flush SKIPPED cpu1 pfn=8eff6 ts=48629557020154
>   kernel: dcache flush FINISH  cpu0 pfn=8eff6 ts=48629557036154
>   audisp-syslog: comm="journalctl" exe="/usr/bin/journalctl" sig=4 [...]
> 
> Discussions in the mailing list mentioned that arch/arm is also affected
> but the fix was never applied to it [1][2]. Apply the change now, since
> the race condition can cause sporadic SIGILL's and SEGV's especially
> while under high memory pressure.
> 
> Link: https://lore.kernel.org/all/adzMOdySgMIePcue@willie-the-truck [1]
> Link: https://lore.kernel.org/all/20210514095001.13236-1-catalin.marinas@arm.com [2]
> Signed-off-by: Brian Ruley <brian.ruley@gehealthcare.com>
> ---
>  arch/arm/mm/flush.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
> index 19470d938b23..4d7ef5cc36b6 100644
> --- a/arch/arm/mm/flush.c
> +++ b/arch/arm/mm/flush.c
> @@ -304,8 +304,10 @@ void __sync_icache_dcache(pte_t pteval)
>  	else
>  		mapping = NULL;
>  
> -	if (!test_and_set_bit(PG_dcache_clean, &folio->flags.f))
> +	if (!test_bit(PG_dcache_clean, &folio->flags.f)) {
>  		__flush_dcache_folio(mapping, folio);
> +		set_bit(PG_dcache_clean, &folio->flags.f);
> +	}
>  
>  	if (pte_exec(pteval))
>  		__flush_icache_all();

Thanks, Brian:

Reviewed-by: Will Deacon <will@kernel.org>

Will


^ permalink raw reply

* Re: [PATCH] arm64: dts: st: Fix SAI addresses on stm32mp251
From: Olivier MOYSAN @ 2026-04-15 17:00 UTC (permalink / raw)
  To: Marek Vasut, linux-arm-kernel
  Cc: Alexandre Torgue, Conor Dooley, Krzysztof Kozlowski,
	Maxime Coquelin, Rob Herring, devicetree, linux-kernel,
	linux-stm32
In-Reply-To: <20260411130300.19603-1-marex@nabladev.com>

Hi Marek,

On 4/11/26 15:02, Marek Vasut wrote:
> The second field of SAI register addresses should be within 0x3f0 bytes
> from the start of the SAI register addresses, the second field describes
> the ID registers which are at that addrses. Currently, the second field
> does not match RM, fix it.
> 
> Fixes: bf26d75a95f1 ("arm64: dts: st: add sai support on stm32mp251")
> Signed-off-by: Marek Vasut <marex@nabladev.com>
> ---
> Cc: Alexandre Torgue <alexandre.torgue@foss.st.com>
> Cc: Conor Dooley <conor+dt@kernel.org>
> Cc: Krzysztof Kozlowski <krzk+dt@kernel.org>
> Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
> Cc: Olivier Moysan <olivier.moysan@foss.st.com>
> Cc: Rob Herring <robh@kernel.org>
> Cc: devicetree@vger.kernel.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-stm32@st-md-mailman.stormreply.com
> ---
>   arch/arm64/boot/dts/st/stm32mp251.dtsi | 8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/boot/dts/st/stm32mp251.dtsi b/arch/arm64/boot/dts/st/stm32mp251.dtsi
> index 673fbc5632e69..9c63fdb5a885a 100644
> --- a/arch/arm64/boot/dts/st/stm32mp251.dtsi
> +++ b/arch/arm64/boot/dts/st/stm32mp251.dtsi
> @@ -1202,7 +1202,7 @@ spi5: spi@40280000 {
>   
>   			sai1: sai@40290000 {
>   				compatible = "st,stm32mp25-sai";
> -				reg = <0x40290000 0x4>, <0x4029a3f0 0x10>;
> +				reg = <0x40290000 0x4>, <0x402903f0 0x10>;
>   				ranges = <0 0x40290000 0x400>;
>   				#address-cells = <1>;
>   				#size-cells = <1>;
> @@ -1236,7 +1236,7 @@ sai1b: audio-controller@40290024 {
>   
>   			sai2: sai@402a0000 {
>   				compatible = "st,stm32mp25-sai";
> -				reg = <0x402a0000 0x4>, <0x402aa3f0 0x10>;
> +				reg = <0x402a0000 0x4>, <0x402a03f0 0x10>;
>   				ranges = <0 0x402a0000 0x400>;
>   				#address-cells = <1>;
>   				#size-cells = <1>;
> @@ -1270,7 +1270,7 @@ sai2b: audio-controller@402a0024 {
>   
>   			sai3: sai@402b0000 {
>   				compatible = "st,stm32mp25-sai";
> -				reg = <0x402b0000 0x4>, <0x402ba3f0 0x10>;
> +				reg = <0x402b0000 0x4>, <0x402b03f0 0x10>;
>   				ranges = <0 0x402b0000 0x400>;
>   				#address-cells = <1>;
>   				#size-cells = <1>;
> @@ -1362,7 +1362,7 @@ usart1: serial@40330000 {
>   
>   			sai4: sai@40340000 {
>   				compatible = "st,stm32mp25-sai";
> -				reg = <0x40340000 0x4>, <0x4034a3f0 0x10>;
> +				reg = <0x40340000 0x4>, <0x403403f0 0x10>;
>   				ranges = <0 0x40340000 0x400>;
>   				#address-cells = <1>;
>   				#size-cells = <1>;

Reviewed-by: Olivier Moysan <olivier.moysan@foss.st.com>

Thanks for your patch
BRs
Olivier


^ permalink raw reply

* Re: [PATCH net v2] net: airoha: Add missing bits in airoha_qdma_cleanup_tx_queue()
From: Lorenzo Bianconi @ 2026-04-15 16:58 UTC (permalink / raw)
  To: Simon Horman
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-arm-kernel, linux-mediatek, netdev
In-Reply-To: <20260415164643.GQ772670@horms.kernel.org>

[-- Attachment #1: Type: text/plain, Size: 5373 bytes --]

On Apr 15, Simon Horman wrote:
> On Tue, Apr 14, 2026 at 08:50:52AM +0200, Lorenzo Bianconi wrote:
> > Similar to airoha_qdma_cleanup_rx_queue(), reset DMA TX descriptors in
> > airoha_qdma_cleanup_tx_queue routine. Moreover, reset TX_DMA_IDX to
> > TX_CPU_IDX to notify the NIC the QDMA TX ring is empty.
> > 
> > Fixes: 23020f0493270 ("net: airoha: Introduce ethernet support for EN7581 SoC")
> > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> > ---
> > Changes in v2:
> > - Move q->ndesc initialization at end of airoha_qdma_init_tx routine in
> >   order to avoid any possible NULL pointer dereference in
> >   airoha_qdma_cleanup_tx_queue()
> 
> This seems to be a separate issue.
> If so, I think it should be split out into a separate patch.
> 
> > - Check if q->tx_list is empty in airoha_qdma_cleanup_tx_queue()
> > - Link to v1: https://lore.kernel.org/r/20260410-airoha_qdma_cleanup_tx_queue-fix-net-v1-1-b7171c8f1e78@kernel.org
> 
> I think it was covered in the review Jakub forwarded for v1.  But FTR,
> Sashiko has some feedback on this patch in the form of an existing bug
> (that should almost certainly be handled separately from this patch).

Hi Simon,

I took a look to the Sashiko's report [0] but this issue is not introduced by
this patch and, even if it would be a better approach, I guess the hw is
capable of managing out-of-order TX descriptors. So I guess this patch is fine
in this way, agree?

[0] https://sashiko.dev/#/patchset/20260414-airoha_qdma_cleanup_tx_queue-fix-net-v2-1-875de57cc022%40kernel.org

> 
> > ---
> >  drivers/net/ethernet/airoha/airoha_eth.c | 41 ++++++++++++++++++++++++++------
> >  1 file changed, 34 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
> > index 9e995094c32a..3c1a2bc68c42 100644
> > --- a/drivers/net/ethernet/airoha/airoha_eth.c
> > +++ b/drivers/net/ethernet/airoha/airoha_eth.c
> > @@ -966,27 +966,27 @@ static int airoha_qdma_init_tx_queue(struct airoha_queue *q,
> >  	dma_addr_t dma_addr;
> >  
> >  	spin_lock_init(&q->lock);
> > -	q->ndesc = size;
> >  	q->qdma = qdma;
> >  	q->free_thr = 1 + MAX_SKB_FRAGS;
> >  	INIT_LIST_HEAD(&q->tx_list);
> >  
> > -	q->entry = devm_kzalloc(eth->dev, q->ndesc * sizeof(*q->entry),
> > +	q->entry = devm_kzalloc(eth->dev, size * sizeof(*q->entry),
> >  				GFP_KERNEL);
> >  	if (!q->entry)
> >  		return -ENOMEM;
> >  
> > -	q->desc = dmam_alloc_coherent(eth->dev, q->ndesc * sizeof(*q->desc),
> > +	q->desc = dmam_alloc_coherent(eth->dev, size * sizeof(*q->desc),
> >  				      &dma_addr, GFP_KERNEL);
> >  	if (!q->desc)
> >  		return -ENOMEM;
> >  
> > -	for (i = 0; i < q->ndesc; i++) {
> > +	for (i = 0; i < size; i++) {
> >  		u32 val = FIELD_PREP(QDMA_DESC_DONE_MASK, 1);
> >  
> >  		list_add_tail(&q->entry[i].list, &q->tx_list);
> >  		WRITE_ONCE(q->desc[i].ctrl, cpu_to_le32(val));
> >  	}
> > +	q->ndesc = size;
> >  
> >  	/* xmit ring drop default setting */
> >  	airoha_qdma_set(qdma, REG_TX_RING_BLOCKING(qid),
> > @@ -1051,13 +1051,17 @@ static int airoha_qdma_init_tx(struct airoha_qdma *qdma)
> >  
> >  static void airoha_qdma_cleanup_tx_queue(struct airoha_queue *q)
> >  {
> > -	struct airoha_eth *eth = q->qdma->eth;
> > -	int i;
> > +	struct airoha_qdma *qdma = q->qdma;
> > +	struct airoha_eth *eth = qdma->eth;
> > +	int i, qid = q - &qdma->q_tx[0];
> > +	struct airoha_queue_entry *e;
> > +	u16 index = 0;
> >  
> >  	spin_lock_bh(&q->lock);
> >  	for (i = 0; i < q->ndesc; i++) {
> > -		struct airoha_queue_entry *e = &q->entry[i];
> 
> super nit: In v2 e is always used within a block (here and in the hunk below).
>            So I would lean towards declaring e in the blocks where it is
> 	   used.
> 
> 	   No need to repost just for this!

I can fix it if I need to repost.

Regards,
Lorenzo

> 
> > +		struct airoha_qdma_desc *desc = &q->desc[i];
> >  
> > +		e = &q->entry[i];
> >  		if (!e->dma_addr)
> >  			continue;
> >  
> > @@ -1067,8 +1071,31 @@ static void airoha_qdma_cleanup_tx_queue(struct airoha_queue *q)
> >  		e->dma_addr = 0;
> >  		e->skb = NULL;
> >  		list_add_tail(&e->list, &q->tx_list);
> > +
> > +		/* Reset DMA descriptor */
> > +		WRITE_ONCE(desc->ctrl, 0);
> > +		WRITE_ONCE(desc->addr, 0);
> > +		WRITE_ONCE(desc->data, 0);
> > +		WRITE_ONCE(desc->msg0, 0);
> > +		WRITE_ONCE(desc->msg1, 0);
> > +		WRITE_ONCE(desc->msg2, 0);
> > +
> >  		q->queued--;
> >  	}
> > +
> > +	if (!list_empty(&q->tx_list)) {
> > +		e = list_first_entry(&q->tx_list, struct airoha_queue_entry,
> > +				     list);
> > +		index = e - q->entry;
> > +	}
> > +	/* Set TX_DMA_IDX to TX_CPU_IDX to notify the hw the QDMA TX ring is
> > +	 * empty.
> > +	 */
> > +	airoha_qdma_rmw(qdma, REG_TX_CPU_IDX(qid), TX_RING_CPU_IDX_MASK,
> > +			FIELD_PREP(TX_RING_CPU_IDX_MASK, index));
> > +	airoha_qdma_rmw(qdma, REG_TX_DMA_IDX(qid), TX_RING_DMA_IDX_MASK,
> > +			FIELD_PREP(TX_RING_DMA_IDX_MASK, index));
> > +
> >  	spin_unlock_bh(&q->lock);
> >  }
> >  
> > 
> > ---
> > base-commit: 2cd7e6971fc2787408ceef17906ea152791448cf
> > change-id: 20260410-airoha_qdma_cleanup_tx_queue-fix-net-93375f5ee80f
> > 
> > Best regards,
> > -- 
> > Lorenzo Bianconi <lorenzo@kernel.org>
> > 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* [PATCH v5 12/12] coresight: etm3x: remove redundant call etm_enable_hw() with hotplug
From: Yeoreum Yun @ 2026-04-15 16:55 UTC (permalink / raw)
  To: coresight, linux-arm-kernel, linux-kernel
  Cc: suzuki.poulose, mike.leach, james.clark, alexander.shishkin,
	leo.yan, jie.gan, Yeoreum Yun
In-Reply-To: <20260415165528.3369607-1-yeoreum.yun@arm.com>

The cpu_online_mask is set at the CPUHP_BRINGUP_CPU step.
In other words, if etm4_enable_sysfs() is called between
CPUHP_BRINGUP_CPU and CPUHP_AP_ARM_CORESIGHT_STARTING,
etm_enable_hw() may be invoked in etm_enable_sysfs_smp_call()
and then executed again in etm_starting_cpu().

To remove this redundant call, take the hotplug lock before executing
etm_enable_sysfs_smp_call().

Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
---
 .../coresight/coresight-etm3x-core.c          | 24 +++++++++----------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/hwtracing/coresight/coresight-etm3x-core.c b/drivers/hwtracing/coresight/coresight-etm3x-core.c
index 706a4c7a40df..8bf7ae276096 100644
--- a/drivers/hwtracing/coresight/coresight-etm3x-core.c
+++ b/drivers/hwtracing/coresight/coresight-etm3x-core.c
@@ -521,26 +521,26 @@ static int etm_enable_sysfs(struct coresight_device *csdev, struct coresight_pat
 	struct etm_enable_arg arg = { };
 	int ret;
 
+	arg.drvdata = drvdata;
+	arg.traceid = path->trace_id;
+
+	raw_spin_lock(&drvdata->spinlock);
+	arg.config = drvdata->config;
+	raw_spin_unlock(&drvdata->spinlock);
 
 	/*
 	 * Configure the ETM only if the CPU is online.  If it isn't online
 	 * hw configuration will take place on the local CPU during bring up.
 	 */
-	if (cpu_online(drvdata->cpu)) {
-		arg.drvdata = drvdata;
-		arg.traceid = path->trace_id;
-
-		raw_spin_lock(&drvdata->spinlock);
-		arg.config = drvdata->config;
-		raw_spin_unlock(&drvdata->spinlock);
-
+	cpus_read_lock();
 		ret = smp_call_function_single(drvdata->cpu,
 					       etm_enable_sysfs_smp_call, &arg, 1);
-		if (!ret)
-			ret = arg.rc;
-	} else {
+	cpus_read_unlock();
+
+	if (!ret)
+		ret = arg.rc;
+	else
 		ret = -ENODEV;
-	}
 
 	if (!ret)
 		dev_dbg(&csdev->dev, "ETM tracing enabled\n");
-- 
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}



^ permalink raw reply related

* [PATCH v5 09/12] coresight: etm3x: change drvdata->spinlock type to raw_spin_lock_t
From: Yeoreum Yun @ 2026-04-15 16:55 UTC (permalink / raw)
  To: coresight, linux-arm-kernel, linux-kernel
  Cc: suzuki.poulose, mike.leach, james.clark, alexander.shishkin,
	leo.yan, jie.gan, Yeoreum Yun
In-Reply-To: <20260415165528.3369607-1-yeoreum.yun@arm.com>

etm_starting_cpu()/etm_dying_cpu() are called in not sleepable context.
This poses an issue in PREEMPT_RT kernel where spinlock_t is sleepable.

To address this, change etm3's drvdata->spinlock type to raw_spin_lock_t.
This will be good to align with etm4x.

Reviewed-by: Jie Gan <jie.gan@oss.qualcomm.com>
Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
---
 drivers/hwtracing/coresight/coresight-etm.h   |   2 +-
 .../coresight/coresight-etm3x-core.c          |  18 +--
 .../coresight/coresight-etm3x-sysfs.c         | 130 +++++++++---------
 3 files changed, 75 insertions(+), 75 deletions(-)

diff --git a/drivers/hwtracing/coresight/coresight-etm.h b/drivers/hwtracing/coresight/coresight-etm.h
index 1d753cca2943..40f20daded4f 100644
--- a/drivers/hwtracing/coresight/coresight-etm.h
+++ b/drivers/hwtracing/coresight/coresight-etm.h
@@ -232,7 +232,7 @@ struct etm_drvdata {
 	struct csdev_access		csa;
 	struct clk			*atclk;
 	struct coresight_device		*csdev;
-	spinlock_t			spinlock;
+	raw_spinlock_t			spinlock;
 	int				cpu;
 	int				port_size;
 	u8				arch;
diff --git a/drivers/hwtracing/coresight/coresight-etm3x-core.c b/drivers/hwtracing/coresight/coresight-etm3x-core.c
index a547a6d2e0bd..4a702b515733 100644
--- a/drivers/hwtracing/coresight/coresight-etm3x-core.c
+++ b/drivers/hwtracing/coresight/coresight-etm3x-core.c
@@ -511,7 +511,7 @@ static int etm_enable_sysfs(struct coresight_device *csdev, struct coresight_pat
 	struct etm_enable_arg arg = { };
 	int ret;
 
-	spin_lock(&drvdata->spinlock);
+	raw_spin_lock(&drvdata->spinlock);
 
 	drvdata->traceid = path->trace_id;
 
@@ -534,7 +534,7 @@ static int etm_enable_sysfs(struct coresight_device *csdev, struct coresight_pat
 	if (ret)
 		etm_release_trace_id(drvdata);
 
-	spin_unlock(&drvdata->spinlock);
+	raw_spin_unlock(&drvdata->spinlock);
 
 	if (!ret)
 		dev_dbg(&csdev->dev, "ETM tracing enabled\n");
@@ -634,7 +634,7 @@ static void etm_disable_sysfs(struct coresight_device *csdev)
 	 * DYING hotplug callback is serviced by the ETM driver.
 	 */
 	cpus_read_lock();
-	spin_lock(&drvdata->spinlock);
+	raw_spin_lock(&drvdata->spinlock);
 
 	/*
 	 * Executing etm_disable_hw on the cpu whose ETM is being disabled
@@ -643,7 +643,7 @@ static void etm_disable_sysfs(struct coresight_device *csdev)
 	smp_call_function_single(drvdata->cpu, etm_disable_sysfs_smp_call,
 				 drvdata, 1);
 
-	spin_unlock(&drvdata->spinlock);
+	raw_spin_unlock(&drvdata->spinlock);
 	cpus_read_unlock();
 
 	/*
@@ -709,7 +709,7 @@ static int etm_starting_cpu(unsigned int cpu)
 	if (!etmdrvdata[cpu])
 		return 0;
 
-	spin_lock(&etmdrvdata[cpu]->spinlock);
+	raw_spin_lock(&etmdrvdata[cpu]->spinlock);
 	if (!etmdrvdata[cpu]->os_unlock) {
 		etm_os_unlock(etmdrvdata[cpu]);
 		etmdrvdata[cpu]->os_unlock = true;
@@ -717,7 +717,7 @@ static int etm_starting_cpu(unsigned int cpu)
 
 	if (coresight_get_mode(etmdrvdata[cpu]->csdev))
 		etm_enable_hw(etmdrvdata[cpu]);
-	spin_unlock(&etmdrvdata[cpu]->spinlock);
+	raw_spin_unlock(&etmdrvdata[cpu]->spinlock);
 	return 0;
 }
 
@@ -726,10 +726,10 @@ static int etm_dying_cpu(unsigned int cpu)
 	if (!etmdrvdata[cpu])
 		return 0;
 
-	spin_lock(&etmdrvdata[cpu]->spinlock);
+	raw_spin_lock(&etmdrvdata[cpu]->spinlock);
 	if (coresight_get_mode(etmdrvdata[cpu]->csdev))
 		etm_disable_hw(etmdrvdata[cpu]);
-	spin_unlock(&etmdrvdata[cpu]->spinlock);
+	raw_spin_unlock(&etmdrvdata[cpu]->spinlock);
 	return 0;
 }
 
@@ -856,7 +856,7 @@ static int etm_probe(struct amba_device *adev, const struct amba_id *id)
 
 	desc.access = drvdata->csa = CSDEV_ACCESS_IOMEM(base);
 
-	spin_lock_init(&drvdata->spinlock);
+	raw_spin_lock_init(&drvdata->spinlock);
 
 	drvdata->atclk = devm_clk_get_optional_enabled(dev, "atclk");
 	if (IS_ERR(drvdata->atclk))
diff --git a/drivers/hwtracing/coresight/coresight-etm3x-sysfs.c b/drivers/hwtracing/coresight/coresight-etm3x-sysfs.c
index 762109307b86..42b12c33516b 100644
--- a/drivers/hwtracing/coresight/coresight-etm3x-sysfs.c
+++ b/drivers/hwtracing/coresight/coresight-etm3x-sysfs.c
@@ -49,13 +49,13 @@ static ssize_t etmsr_show(struct device *dev,
 	struct etm_drvdata *drvdata = dev_get_drvdata(dev->parent);
 
 	pm_runtime_get_sync(dev->parent);
-	spin_lock_irqsave(&drvdata->spinlock, flags);
+	raw_spin_lock_irqsave(&drvdata->spinlock, flags);
 	CS_UNLOCK(drvdata->csa.base);
 
 	val = etm_readl(drvdata, ETMSR);
 
 	CS_LOCK(drvdata->csa.base);
-	spin_unlock_irqrestore(&drvdata->spinlock, flags);
+	raw_spin_unlock_irqrestore(&drvdata->spinlock, flags);
 	pm_runtime_put(dev->parent);
 
 	return sprintf(buf, "%#lx\n", val);
@@ -76,7 +76,7 @@ static ssize_t reset_store(struct device *dev,
 		return ret;
 
 	if (val) {
-		spin_lock(&drvdata->spinlock);
+		raw_spin_lock(&drvdata->spinlock);
 		memset(config, 0, sizeof(struct etm_config));
 		config->mode = ETM_MODE_EXCLUDE;
 		config->trigger_event = ETM_DEFAULT_EVENT_VAL;
@@ -86,7 +86,7 @@ static ssize_t reset_store(struct device *dev,
 
 		etm_set_default(config);
 		etm_release_trace_id(drvdata);
-		spin_unlock(&drvdata->spinlock);
+		raw_spin_unlock(&drvdata->spinlock);
 	}
 
 	return size;
@@ -117,7 +117,7 @@ static ssize_t mode_store(struct device *dev,
 	if (ret)
 		return ret;
 
-	spin_lock(&drvdata->spinlock);
+	raw_spin_lock(&drvdata->spinlock);
 	config->mode = val & ETM_MODE_ALL;
 
 	if (config->mode & ETM_MODE_EXCLUDE)
@@ -168,12 +168,12 @@ static ssize_t mode_store(struct device *dev,
 	if (config->mode & (ETM_MODE_EXCL_KERN | ETM_MODE_EXCL_USER))
 		etm_config_trace_mode(config);
 
-	spin_unlock(&drvdata->spinlock);
+	raw_spin_unlock(&drvdata->spinlock);
 
 	return size;
 
 err_unlock:
-	spin_unlock(&drvdata->spinlock);
+	raw_spin_unlock(&drvdata->spinlock);
 	return ret;
 }
 static DEVICE_ATTR_RW(mode);
@@ -299,9 +299,9 @@ static ssize_t addr_idx_store(struct device *dev,
 	 * Use spinlock to ensure index doesn't change while it gets
 	 * dereferenced multiple times within a spinlock block elsewhere.
 	 */
-	spin_lock(&drvdata->spinlock);
+	raw_spin_lock(&drvdata->spinlock);
 	config->addr_idx = val;
-	spin_unlock(&drvdata->spinlock);
+	raw_spin_unlock(&drvdata->spinlock);
 
 	return size;
 }
@@ -315,16 +315,16 @@ static ssize_t addr_single_show(struct device *dev,
 	struct etm_drvdata *drvdata = dev_get_drvdata(dev->parent);
 	struct etm_config *config = &drvdata->config;
 
-	spin_lock(&drvdata->spinlock);
+	raw_spin_lock(&drvdata->spinlock);
 	idx = config->addr_idx;
 	if (!(config->addr_type[idx] == ETM_ADDR_TYPE_NONE ||
 	      config->addr_type[idx] == ETM_ADDR_TYPE_SINGLE)) {
-		spin_unlock(&drvdata->spinlock);
+		raw_spin_unlock(&drvdata->spinlock);
 		return -EINVAL;
 	}
 
 	val = config->addr_val[idx];
-	spin_unlock(&drvdata->spinlock);
+	raw_spin_unlock(&drvdata->spinlock);
 
 	return sprintf(buf, "%#lx\n", val);
 }
@@ -343,17 +343,17 @@ static ssize_t addr_single_store(struct device *dev,
 	if (ret)
 		return ret;
 
-	spin_lock(&drvdata->spinlock);
+	raw_spin_lock(&drvdata->spinlock);
 	idx = config->addr_idx;
 	if (!(config->addr_type[idx] == ETM_ADDR_TYPE_NONE ||
 	      config->addr_type[idx] == ETM_ADDR_TYPE_SINGLE)) {
-		spin_unlock(&drvdata->spinlock);
+		raw_spin_unlock(&drvdata->spinlock);
 		return -EINVAL;
 	}
 
 	config->addr_val[idx] = val;
 	config->addr_type[idx] = ETM_ADDR_TYPE_SINGLE;
-	spin_unlock(&drvdata->spinlock);
+	raw_spin_unlock(&drvdata->spinlock);
 
 	return size;
 }
@@ -367,23 +367,23 @@ static ssize_t addr_range_show(struct device *dev,
 	struct etm_drvdata *drvdata = dev_get_drvdata(dev->parent);
 	struct etm_config *config = &drvdata->config;
 
-	spin_lock(&drvdata->spinlock);
+	raw_spin_lock(&drvdata->spinlock);
 	idx = config->addr_idx;
 	if (idx % 2 != 0) {
-		spin_unlock(&drvdata->spinlock);
+		raw_spin_unlock(&drvdata->spinlock);
 		return -EPERM;
 	}
 	if (!((config->addr_type[idx] == ETM_ADDR_TYPE_NONE &&
 	       config->addr_type[idx + 1] == ETM_ADDR_TYPE_NONE) ||
 	      (config->addr_type[idx] == ETM_ADDR_TYPE_RANGE &&
 	       config->addr_type[idx + 1] == ETM_ADDR_TYPE_RANGE))) {
-		spin_unlock(&drvdata->spinlock);
+		raw_spin_unlock(&drvdata->spinlock);
 		return -EPERM;
 	}
 
 	val1 = config->addr_val[idx];
 	val2 = config->addr_val[idx + 1];
-	spin_unlock(&drvdata->spinlock);
+	raw_spin_unlock(&drvdata->spinlock);
 
 	return sprintf(buf, "%#lx %#lx\n", val1, val2);
 }
@@ -403,17 +403,17 @@ static ssize_t addr_range_store(struct device *dev,
 	if (val1 > val2)
 		return -EINVAL;
 
-	spin_lock(&drvdata->spinlock);
+	raw_spin_lock(&drvdata->spinlock);
 	idx = config->addr_idx;
 	if (idx % 2 != 0) {
-		spin_unlock(&drvdata->spinlock);
+		raw_spin_unlock(&drvdata->spinlock);
 		return -EPERM;
 	}
 	if (!((config->addr_type[idx] == ETM_ADDR_TYPE_NONE &&
 	       config->addr_type[idx + 1] == ETM_ADDR_TYPE_NONE) ||
 	      (config->addr_type[idx] == ETM_ADDR_TYPE_RANGE &&
 	       config->addr_type[idx + 1] == ETM_ADDR_TYPE_RANGE))) {
-		spin_unlock(&drvdata->spinlock);
+		raw_spin_unlock(&drvdata->spinlock);
 		return -EPERM;
 	}
 
@@ -422,7 +422,7 @@ static ssize_t addr_range_store(struct device *dev,
 	config->addr_val[idx + 1] = val2;
 	config->addr_type[idx + 1] = ETM_ADDR_TYPE_RANGE;
 	config->enable_ctrl1 |= (1 << (idx/2));
-	spin_unlock(&drvdata->spinlock);
+	raw_spin_unlock(&drvdata->spinlock);
 
 	return size;
 }
@@ -436,16 +436,16 @@ static ssize_t addr_start_show(struct device *dev,
 	struct etm_drvdata *drvdata = dev_get_drvdata(dev->parent);
 	struct etm_config *config = &drvdata->config;
 
-	spin_lock(&drvdata->spinlock);
+	raw_spin_lock(&drvdata->spinlock);
 	idx = config->addr_idx;
 	if (!(config->addr_type[idx] == ETM_ADDR_TYPE_NONE ||
 	      config->addr_type[idx] == ETM_ADDR_TYPE_START)) {
-		spin_unlock(&drvdata->spinlock);
+		raw_spin_unlock(&drvdata->spinlock);
 		return -EPERM;
 	}
 
 	val = config->addr_val[idx];
-	spin_unlock(&drvdata->spinlock);
+	raw_spin_unlock(&drvdata->spinlock);
 
 	return sprintf(buf, "%#lx\n", val);
 }
@@ -464,11 +464,11 @@ static ssize_t addr_start_store(struct device *dev,
 	if (ret)
 		return ret;
 
-	spin_lock(&drvdata->spinlock);
+	raw_spin_lock(&drvdata->spinlock);
 	idx = config->addr_idx;
 	if (!(config->addr_type[idx] == ETM_ADDR_TYPE_NONE ||
 	      config->addr_type[idx] == ETM_ADDR_TYPE_START)) {
-		spin_unlock(&drvdata->spinlock);
+		raw_spin_unlock(&drvdata->spinlock);
 		return -EPERM;
 	}
 
@@ -476,7 +476,7 @@ static ssize_t addr_start_store(struct device *dev,
 	config->addr_type[idx] = ETM_ADDR_TYPE_START;
 	config->startstop_ctrl |= (1 << idx);
 	config->enable_ctrl1 |= ETMTECR1_START_STOP;
-	spin_unlock(&drvdata->spinlock);
+	raw_spin_unlock(&drvdata->spinlock);
 
 	return size;
 }
@@ -490,16 +490,16 @@ static ssize_t addr_stop_show(struct device *dev,
 	struct etm_drvdata *drvdata = dev_get_drvdata(dev->parent);
 	struct etm_config *config = &drvdata->config;
 
-	spin_lock(&drvdata->spinlock);
+	raw_spin_lock(&drvdata->spinlock);
 	idx = config->addr_idx;
 	if (!(config->addr_type[idx] == ETM_ADDR_TYPE_NONE ||
 	      config->addr_type[idx] == ETM_ADDR_TYPE_STOP)) {
-		spin_unlock(&drvdata->spinlock);
+		raw_spin_unlock(&drvdata->spinlock);
 		return -EPERM;
 	}
 
 	val = config->addr_val[idx];
-	spin_unlock(&drvdata->spinlock);
+	raw_spin_unlock(&drvdata->spinlock);
 
 	return sprintf(buf, "%#lx\n", val);
 }
@@ -518,11 +518,11 @@ static ssize_t addr_stop_store(struct device *dev,
 	if (ret)
 		return ret;
 
-	spin_lock(&drvdata->spinlock);
+	raw_spin_lock(&drvdata->spinlock);
 	idx = config->addr_idx;
 	if (!(config->addr_type[idx] == ETM_ADDR_TYPE_NONE ||
 	      config->addr_type[idx] == ETM_ADDR_TYPE_STOP)) {
-		spin_unlock(&drvdata->spinlock);
+		raw_spin_unlock(&drvdata->spinlock);
 		return -EPERM;
 	}
 
@@ -530,7 +530,7 @@ static ssize_t addr_stop_store(struct device *dev,
 	config->addr_type[idx] = ETM_ADDR_TYPE_STOP;
 	config->startstop_ctrl |= (1 << (idx + 16));
 	config->enable_ctrl1 |= ETMTECR1_START_STOP;
-	spin_unlock(&drvdata->spinlock);
+	raw_spin_unlock(&drvdata->spinlock);
 
 	return size;
 }
@@ -543,9 +543,9 @@ static ssize_t addr_acctype_show(struct device *dev,
 	struct etm_drvdata *drvdata = dev_get_drvdata(dev->parent);
 	struct etm_config *config = &drvdata->config;
 
-	spin_lock(&drvdata->spinlock);
+	raw_spin_lock(&drvdata->spinlock);
 	val = config->addr_acctype[config->addr_idx];
-	spin_unlock(&drvdata->spinlock);
+	raw_spin_unlock(&drvdata->spinlock);
 
 	return sprintf(buf, "%#lx\n", val);
 }
@@ -563,9 +563,9 @@ static ssize_t addr_acctype_store(struct device *dev,
 	if (ret)
 		return ret;
 
-	spin_lock(&drvdata->spinlock);
+	raw_spin_lock(&drvdata->spinlock);
 	config->addr_acctype[config->addr_idx] = val;
-	spin_unlock(&drvdata->spinlock);
+	raw_spin_unlock(&drvdata->spinlock);
 
 	return size;
 }
@@ -601,9 +601,9 @@ static ssize_t cntr_idx_store(struct device *dev,
 	 * Use spinlock to ensure index doesn't change while it gets
 	 * dereferenced multiple times within a spinlock block elsewhere.
 	 */
-	spin_lock(&drvdata->spinlock);
+	raw_spin_lock(&drvdata->spinlock);
 	config->cntr_idx = val;
-	spin_unlock(&drvdata->spinlock);
+	raw_spin_unlock(&drvdata->spinlock);
 
 	return size;
 }
@@ -616,9 +616,9 @@ static ssize_t cntr_rld_val_show(struct device *dev,
 	struct etm_drvdata *drvdata = dev_get_drvdata(dev->parent);
 	struct etm_config *config = &drvdata->config;
 
-	spin_lock(&drvdata->spinlock);
+	raw_spin_lock(&drvdata->spinlock);
 	val = config->cntr_rld_val[config->cntr_idx];
-	spin_unlock(&drvdata->spinlock);
+	raw_spin_unlock(&drvdata->spinlock);
 
 	return sprintf(buf, "%#lx\n", val);
 }
@@ -636,9 +636,9 @@ static ssize_t cntr_rld_val_store(struct device *dev,
 	if (ret)
 		return ret;
 
-	spin_lock(&drvdata->spinlock);
+	raw_spin_lock(&drvdata->spinlock);
 	config->cntr_rld_val[config->cntr_idx] = val;
-	spin_unlock(&drvdata->spinlock);
+	raw_spin_unlock(&drvdata->spinlock);
 
 	return size;
 }
@@ -651,9 +651,9 @@ static ssize_t cntr_event_show(struct device *dev,
 	struct etm_drvdata *drvdata = dev_get_drvdata(dev->parent);
 	struct etm_config *config = &drvdata->config;
 
-	spin_lock(&drvdata->spinlock);
+	raw_spin_lock(&drvdata->spinlock);
 	val = config->cntr_event[config->cntr_idx];
-	spin_unlock(&drvdata->spinlock);
+	raw_spin_unlock(&drvdata->spinlock);
 
 	return sprintf(buf, "%#lx\n", val);
 }
@@ -671,9 +671,9 @@ static ssize_t cntr_event_store(struct device *dev,
 	if (ret)
 		return ret;
 
-	spin_lock(&drvdata->spinlock);
+	raw_spin_lock(&drvdata->spinlock);
 	config->cntr_event[config->cntr_idx] = val & ETM_EVENT_MASK;
-	spin_unlock(&drvdata->spinlock);
+	raw_spin_unlock(&drvdata->spinlock);
 
 	return size;
 }
@@ -686,9 +686,9 @@ static ssize_t cntr_rld_event_show(struct device *dev,
 	struct etm_drvdata *drvdata = dev_get_drvdata(dev->parent);
 	struct etm_config *config = &drvdata->config;
 
-	spin_lock(&drvdata->spinlock);
+	raw_spin_lock(&drvdata->spinlock);
 	val = config->cntr_rld_event[config->cntr_idx];
-	spin_unlock(&drvdata->spinlock);
+	raw_spin_unlock(&drvdata->spinlock);
 
 	return sprintf(buf, "%#lx\n", val);
 }
@@ -706,9 +706,9 @@ static ssize_t cntr_rld_event_store(struct device *dev,
 	if (ret)
 		return ret;
 
-	spin_lock(&drvdata->spinlock);
+	raw_spin_lock(&drvdata->spinlock);
 	config->cntr_rld_event[config->cntr_idx] = val & ETM_EVENT_MASK;
-	spin_unlock(&drvdata->spinlock);
+	raw_spin_unlock(&drvdata->spinlock);
 
 	return size;
 }
@@ -723,11 +723,11 @@ static ssize_t cntr_val_show(struct device *dev,
 	struct etm_config *config = &drvdata->config;
 
 	if (!coresight_get_mode(drvdata->csdev)) {
-		spin_lock(&drvdata->spinlock);
+		raw_spin_lock(&drvdata->spinlock);
 		for (i = 0; i < drvdata->nr_cntr; i++)
 			ret += sprintf(buf, "counter %d: %x\n",
 				       i, config->cntr_val[i]);
-		spin_unlock(&drvdata->spinlock);
+		raw_spin_unlock(&drvdata->spinlock);
 		return ret;
 	}
 
@@ -752,9 +752,9 @@ static ssize_t cntr_val_store(struct device *dev,
 	if (ret)
 		return ret;
 
-	spin_lock(&drvdata->spinlock);
+	raw_spin_lock(&drvdata->spinlock);
 	config->cntr_val[config->cntr_idx] = val;
-	spin_unlock(&drvdata->spinlock);
+	raw_spin_unlock(&drvdata->spinlock);
 
 	return size;
 }
@@ -947,13 +947,13 @@ static ssize_t seq_curr_state_show(struct device *dev,
 	}
 
 	pm_runtime_get_sync(dev->parent);
-	spin_lock_irqsave(&drvdata->spinlock, flags);
+	raw_spin_lock_irqsave(&drvdata->spinlock, flags);
 
 	CS_UNLOCK(drvdata->csa.base);
 	val = (etm_readl(drvdata, ETMSQR) & ETM_SQR_MASK);
 	CS_LOCK(drvdata->csa.base);
 
-	spin_unlock_irqrestore(&drvdata->spinlock, flags);
+	raw_spin_unlock_irqrestore(&drvdata->spinlock, flags);
 	pm_runtime_put(dev->parent);
 out:
 	return sprintf(buf, "%#lx\n", val);
@@ -1012,9 +1012,9 @@ static ssize_t ctxid_idx_store(struct device *dev,
 	 * Use spinlock to ensure index doesn't change while it gets
 	 * dereferenced multiple times within a spinlock block elsewhere.
 	 */
-	spin_lock(&drvdata->spinlock);
+	raw_spin_lock(&drvdata->spinlock);
 	config->ctxid_idx = val;
-	spin_unlock(&drvdata->spinlock);
+	raw_spin_unlock(&drvdata->spinlock);
 
 	return size;
 }
@@ -1034,9 +1034,9 @@ static ssize_t ctxid_pid_show(struct device *dev,
 	if (task_active_pid_ns(current) != &init_pid_ns)
 		return -EINVAL;
 
-	spin_lock(&drvdata->spinlock);
+	raw_spin_lock(&drvdata->spinlock);
 	val = config->ctxid_pid[config->ctxid_idx];
-	spin_unlock(&drvdata->spinlock);
+	raw_spin_unlock(&drvdata->spinlock);
 
 	return sprintf(buf, "%#lx\n", val);
 }
@@ -1066,9 +1066,9 @@ static ssize_t ctxid_pid_store(struct device *dev,
 	if (ret)
 		return ret;
 
-	spin_lock(&drvdata->spinlock);
+	raw_spin_lock(&drvdata->spinlock);
 	config->ctxid_pid[config->ctxid_idx] = pid;
-	spin_unlock(&drvdata->spinlock);
+	raw_spin_unlock(&drvdata->spinlock);
 
 	return size;
 }
-- 
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}



^ permalink raw reply related

* [PATCH v5 10/12] coresight: etm3x: introduce struct etm_caps
From: Yeoreum Yun @ 2026-04-15 16:55 UTC (permalink / raw)
  To: coresight, linux-arm-kernel, linux-kernel
  Cc: suzuki.poulose, mike.leach, james.clark, alexander.shishkin,
	leo.yan, jie.gan, Yeoreum Yun
In-Reply-To: <20260415165528.3369607-1-yeoreum.yun@arm.com>

Introduce struct etm_caps to describe ETMv3 capabilities
and move capabilities information into it.

Since drvdata->etmccr and drvdata->etmccer are used to check
whether it supports fifofull logic and timestamping,
remove etmccr and etmccer field from drvdata and add relevant fields
in etm_caps structure.

Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
---
 drivers/hwtracing/coresight/coresight-etm.h   | 42 ++++++++++++-------
 .../coresight/coresight-etm3x-core.c          | 39 ++++++++++-------
 .../coresight/coresight-etm3x-sysfs.c         | 29 ++++++++-----
 3 files changed, 67 insertions(+), 43 deletions(-)

diff --git a/drivers/hwtracing/coresight/coresight-etm.h b/drivers/hwtracing/coresight/coresight-etm.h
index 40f20daded4f..932bec82fb47 100644
--- a/drivers/hwtracing/coresight/coresight-etm.h
+++ b/drivers/hwtracing/coresight/coresight-etm.h
@@ -140,6 +140,30 @@
 				 ETM_ADD_COMP_0		|	\
 				 ETM_EVENT_NOT_A)
 
+/**
+ * struct etm_caps - specifics ETM capabilities
+ * @port_size:	port size as reported by ETMCR bit 4-6 and 21.
+ * @nr_addr_cmp:Number of pairs of address comparators as found in ETMCCR.
+ * @nr_cntr:	Number of counters as found in ETMCCR bit 13-15.
+ * @nr_ext_inp:	Number of external input as found in ETMCCR bit 17-19.
+ * @nr_ext_out:	Number of external output as found in ETMCCR bit 20-22.
+ * @nr_ctxid_cmp: Number of contextID comparators as found in ETMCCR bit 24-25.
+ * @fifofull:	FIFOFULL logic is present.
+ * @timestamp:	Timestamping is implemented.
+ * @retstack:	Return stack is implemented.
+ */
+struct etm_caps {
+	int	port_size;
+	u8	nr_addr_cmp;
+	u8	nr_cntr;
+	u8	nr_ext_inp;
+	u8	nr_ext_out;
+	u8	nr_ctxid_cmp;
+	bool	fifofull : 1;
+	bool	timestamp : 1;
+	bool	retstack : 1;
+};
+
 /**
  * struct etm_config - configuration information related to an ETM
  * @mode:	controls various modes supported by this ETM/PTM.
@@ -212,19 +236,12 @@ struct etm_config {
  * @csdev:	component vitals needed by the framework.
  * @spinlock:	only one at a time pls.
  * @cpu:	the cpu this component is affined to.
- * @port_size:	port size as reported by ETMCR bit 4-6 and 21.
  * @arch:	ETM/PTM version number.
+ * @caps:	ETM capabilities.
  * @use_cpu14:	true if management registers need to be accessed via CP14.
  * @sticky_enable: true if ETM base configuration has been done.
  * @boot_enable:true if we should start tracing at boot time.
  * @os_unlock:	true if access to management registers is allowed.
- * @nr_addr_cmp:Number of pairs of address comparators as found in ETMCCR.
- * @nr_cntr:	Number of counters as found in ETMCCR bit 13-15.
- * @nr_ext_inp:	Number of external input as found in ETMCCR bit 17-19.
- * @nr_ext_out:	Number of external output as found in ETMCCR bit 20-22.
- * @nr_ctxid_cmp: Number of contextID comparators as found in ETMCCR bit 24-25.
- * @etmccr:	value of register ETMCCR.
- * @etmccer:	value of register ETMCCER.
  * @traceid:	value of the current ID for this component.
  * @config:	structure holding configuration parameters.
  */
@@ -234,19 +251,12 @@ struct etm_drvdata {
 	struct coresight_device		*csdev;
 	raw_spinlock_t			spinlock;
 	int				cpu;
-	int				port_size;
 	u8				arch;
+	struct etm_caps			caps;
 	bool				use_cp14;
 	bool				sticky_enable;
 	bool				boot_enable;
 	bool				os_unlock;
-	u8				nr_addr_cmp;
-	u8				nr_cntr;
-	u8				nr_ext_inp;
-	u8				nr_ext_out;
-	u8				nr_ctxid_cmp;
-	u32				etmccr;
-	u32				etmccer;
 	u32				traceid;
 	struct etm_config		config;
 };
diff --git a/drivers/hwtracing/coresight/coresight-etm3x-core.c b/drivers/hwtracing/coresight/coresight-etm3x-core.c
index 4a702b515733..e42ca346da91 100644
--- a/drivers/hwtracing/coresight/coresight-etm3x-core.c
+++ b/drivers/hwtracing/coresight/coresight-etm3x-core.c
@@ -308,6 +308,7 @@ void etm_config_trace_mode(struct etm_config *config)
 static int etm_parse_event_config(struct etm_drvdata *drvdata,
 				  struct perf_event *event)
 {
+	const struct etm_caps *caps = &drvdata->caps;
 	struct etm_config *config = &drvdata->config;
 	struct perf_event_attr *attr = &event->attr;
 	u8 ts_level;
@@ -356,8 +357,7 @@ static int etm_parse_event_config(struct etm_drvdata *drvdata,
 	 * has ret stack) on the same SoC. So only enable when it can be honored
 	 * - trace will still continue normally otherwise.
 	 */
-	if (ATTR_CFG_GET_FLD(attr, retstack) &&
-	    (drvdata->etmccer & ETMCCER_RETSTACK))
+	if (ATTR_CFG_GET_FLD(attr, retstack) && (caps->retstack))
 		config->ctrl |= ETMCR_RETURN_STACK;
 
 	return 0;
@@ -367,6 +367,7 @@ static int etm_enable_hw(struct etm_drvdata *drvdata)
 {
 	int i, rc;
 	u32 etmcr;
+	const struct etm_caps *caps = &drvdata->caps;
 	struct etm_config *config = &drvdata->config;
 	struct coresight_device *csdev = drvdata->csdev;
 
@@ -388,7 +389,7 @@ static int etm_enable_hw(struct etm_drvdata *drvdata)
 	etmcr = etm_readl(drvdata, ETMCR);
 	/* Clear setting from a previous run if need be */
 	etmcr &= ~ETM3X_SUPPORTED_OPTIONS;
-	etmcr |= drvdata->port_size;
+	etmcr |= caps->port_size;
 	etmcr |= ETMCR_ETM_EN;
 	etm_writel(drvdata, config->ctrl | etmcr, ETMCR);
 	etm_writel(drvdata, config->trigger_event, ETMTRIGGER);
@@ -396,11 +397,11 @@ static int etm_enable_hw(struct etm_drvdata *drvdata)
 	etm_writel(drvdata, config->enable_event, ETMTEEVR);
 	etm_writel(drvdata, config->enable_ctrl1, ETMTECR1);
 	etm_writel(drvdata, config->fifofull_level, ETMFFLR);
-	for (i = 0; i < drvdata->nr_addr_cmp; i++) {
+	for (i = 0; i < caps->nr_addr_cmp; i++) {
 		etm_writel(drvdata, config->addr_val[i], ETMACVRn(i));
 		etm_writel(drvdata, config->addr_acctype[i], ETMACTRn(i));
 	}
-	for (i = 0; i < drvdata->nr_cntr; i++) {
+	for (i = 0; i < caps->nr_cntr; i++) {
 		etm_writel(drvdata, config->cntr_rld_val[i], ETMCNTRLDVRn(i));
 		etm_writel(drvdata, config->cntr_event[i], ETMCNTENRn(i));
 		etm_writel(drvdata, config->cntr_rld_event[i],
@@ -414,9 +415,9 @@ static int etm_enable_hw(struct etm_drvdata *drvdata)
 	etm_writel(drvdata, config->seq_32_event, ETMSQ32EVR);
 	etm_writel(drvdata, config->seq_13_event, ETMSQ13EVR);
 	etm_writel(drvdata, config->seq_curr_state, ETMSQR);
-	for (i = 0; i < drvdata->nr_ext_out; i++)
+	for (i = 0; i < caps->nr_ext_out; i++)
 		etm_writel(drvdata, ETM_DEFAULT_EVENT_VAL, ETMEXTOUTEVRn(i));
-	for (i = 0; i < drvdata->nr_ctxid_cmp; i++)
+	for (i = 0; i < caps->nr_ctxid_cmp; i++)
 		etm_writel(drvdata, config->ctxid_pid[i], ETMCIDCVRn(i));
 	etm_writel(drvdata, config->ctxid_mask, ETMCIDCMR);
 	etm_writel(drvdata, config->sync_freq, ETMSYNCFR);
@@ -563,6 +564,7 @@ static int etm_enable(struct coresight_device *csdev, struct perf_event *event,
 static void etm_disable_hw(struct etm_drvdata *drvdata)
 {
 	int i;
+	const struct etm_caps *caps = &drvdata->caps;
 	struct etm_config *config = &drvdata->config;
 	struct coresight_device *csdev = drvdata->csdev;
 
@@ -572,7 +574,7 @@ static void etm_disable_hw(struct etm_drvdata *drvdata)
 	/* Read back sequencer and counters for post trace analysis */
 	config->seq_curr_state = (etm_readl(drvdata, ETMSQR) & ETM_SQR_MASK);
 
-	for (i = 0; i < drvdata->nr_cntr; i++)
+	for (i = 0; i < caps->nr_cntr; i++)
 		config->cntr_val[i] = etm_readl(drvdata, ETMCNTVRn(i));
 
 	etm_set_pwrdwn(drvdata);
@@ -754,7 +756,9 @@ static void etm_init_arch_data(void *info)
 {
 	u32 etmidr;
 	u32 etmccr;
+	u32 etmccer;
 	struct etm_drvdata *drvdata = info;
+	struct etm_caps *caps = &drvdata->caps;
 
 	/* Make sure all registers are accessible */
 	etm_os_unlock(drvdata);
@@ -779,16 +783,19 @@ static void etm_init_arch_data(void *info)
 	/* Find all capabilities */
 	etmidr = etm_readl(drvdata, ETMIDR);
 	drvdata->arch = BMVAL(etmidr, 4, 11);
-	drvdata->port_size = etm_readl(drvdata, ETMCR) & PORT_SIZE_MASK;
+	caps->port_size = etm_readl(drvdata, ETMCR) & PORT_SIZE_MASK;
+
+	etmccer = etm_readl(drvdata, ETMCCER);
+	caps->timestamp = !!(etmccer & ETMCCER_TIMESTAMP);
+	caps->retstack = !!(etmccer & ETMCCER_RETSTACK);
 
-	drvdata->etmccer = etm_readl(drvdata, ETMCCER);
 	etmccr = etm_readl(drvdata, ETMCCR);
-	drvdata->etmccr = etmccr;
-	drvdata->nr_addr_cmp = BMVAL(etmccr, 0, 3) * 2;
-	drvdata->nr_cntr = BMVAL(etmccr, 13, 15);
-	drvdata->nr_ext_inp = BMVAL(etmccr, 17, 19);
-	drvdata->nr_ext_out = BMVAL(etmccr, 20, 22);
-	drvdata->nr_ctxid_cmp = BMVAL(etmccr, 24, 25);
+	caps->fifofull = !!(etmccr & ETMCCR_FIFOFULL);
+	caps->nr_addr_cmp = BMVAL(etmccr, 0, 3) * 2;
+	caps->nr_cntr = BMVAL(etmccr, 13, 15);
+	caps->nr_ext_inp = BMVAL(etmccr, 17, 19);
+	caps->nr_ext_out = BMVAL(etmccr, 20, 22);
+	caps->nr_ctxid_cmp = BMVAL(etmccr, 24, 25);
 
 	coresight_clear_self_claim_tag_unlocked(&drvdata->csa);
 	etm_set_pwrdwn(drvdata);
diff --git a/drivers/hwtracing/coresight/coresight-etm3x-sysfs.c b/drivers/hwtracing/coresight/coresight-etm3x-sysfs.c
index 42b12c33516b..f7330d830e21 100644
--- a/drivers/hwtracing/coresight/coresight-etm3x-sysfs.c
+++ b/drivers/hwtracing/coresight/coresight-etm3x-sysfs.c
@@ -15,8 +15,9 @@ static ssize_t nr_addr_cmp_show(struct device *dev,
 {
 	unsigned long val;
 	struct etm_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etm_caps *caps = &drvdata->caps;
 
-	val = drvdata->nr_addr_cmp;
+	val = caps->nr_addr_cmp;
 	return sprintf(buf, "%#lx\n", val);
 }
 static DEVICE_ATTR_RO(nr_addr_cmp);
@@ -25,8 +26,9 @@ static ssize_t nr_cntr_show(struct device *dev,
 			    struct device_attribute *attr, char *buf)
 {	unsigned long val;
 	struct etm_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etm_caps *caps = &drvdata->caps;
 
-	val = drvdata->nr_cntr;
+	val = caps->nr_cntr;
 	return sprintf(buf, "%#lx\n", val);
 }
 static DEVICE_ATTR_RO(nr_cntr);
@@ -37,7 +39,7 @@ static ssize_t nr_ctxid_cmp_show(struct device *dev,
 	unsigned long val;
 	struct etm_drvdata *drvdata = dev_get_drvdata(dev->parent);
 
-	val = drvdata->nr_ctxid_cmp;
+	val = drvdata->caps.nr_ctxid_cmp;
 	return sprintf(buf, "%#lx\n", val);
 }
 static DEVICE_ATTR_RO(nr_ctxid_cmp);
@@ -80,7 +82,7 @@ static ssize_t reset_store(struct device *dev,
 		memset(config, 0, sizeof(struct etm_config));
 		config->mode = ETM_MODE_EXCLUDE;
 		config->trigger_event = ETM_DEFAULT_EVENT_VAL;
-		for (i = 0; i < drvdata->nr_addr_cmp; i++) {
+		for (i = 0; i < drvdata->caps.nr_addr_cmp; i++) {
 			config->addr_type[i] = ETM_ADDR_TYPE_NONE;
 		}
 
@@ -111,6 +113,7 @@ static ssize_t mode_store(struct device *dev,
 	int ret;
 	unsigned long val;
 	struct etm_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etm_caps *caps = &drvdata->caps;
 	struct etm_config *config = &drvdata->config;
 
 	ret = kstrtoul(buf, 16, &val);
@@ -131,7 +134,7 @@ static ssize_t mode_store(struct device *dev,
 		config->ctrl &= ~ETMCR_CYC_ACC;
 
 	if (config->mode & ETM_MODE_STALL) {
-		if (!(drvdata->etmccr & ETMCCR_FIFOFULL)) {
+		if (!caps->fifofull) {
 			dev_warn(dev, "stall mode not supported\n");
 			ret = -EINVAL;
 			goto err_unlock;
@@ -141,7 +144,7 @@ static ssize_t mode_store(struct device *dev,
 		config->ctrl &= ~ETMCR_STALL_MODE;
 
 	if (config->mode & ETM_MODE_TIMESTAMP) {
-		if (!(drvdata->etmccer & ETMCCER_TIMESTAMP)) {
+		if (!caps->timestamp) {
 			dev_warn(dev, "timestamp not supported\n");
 			ret = -EINVAL;
 			goto err_unlock;
@@ -286,13 +289,14 @@ static ssize_t addr_idx_store(struct device *dev,
 	int ret;
 	unsigned long val;
 	struct etm_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etm_caps *caps = &drvdata->caps;
 	struct etm_config *config = &drvdata->config;
 
 	ret = kstrtoul(buf, 16, &val);
 	if (ret)
 		return ret;
 
-	if (val >= drvdata->nr_addr_cmp)
+	if (val >= caps->nr_addr_cmp)
 		return -EINVAL;
 
 	/*
@@ -589,13 +593,14 @@ static ssize_t cntr_idx_store(struct device *dev,
 	int ret;
 	unsigned long val;
 	struct etm_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etm_caps *caps = &drvdata->caps;
 	struct etm_config *config = &drvdata->config;
 
 	ret = kstrtoul(buf, 16, &val);
 	if (ret)
 		return ret;
 
-	if (val >= drvdata->nr_cntr)
+	if (val >= caps->nr_cntr)
 		return -EINVAL;
 	/*
 	 * Use spinlock to ensure index doesn't change while it gets
@@ -720,18 +725,19 @@ static ssize_t cntr_val_show(struct device *dev,
 	int i, ret = 0;
 	u32 val;
 	struct etm_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etm_caps *caps = &drvdata->caps;
 	struct etm_config *config = &drvdata->config;
 
 	if (!coresight_get_mode(drvdata->csdev)) {
 		raw_spin_lock(&drvdata->spinlock);
-		for (i = 0; i < drvdata->nr_cntr; i++)
+		for (i = 0; i < caps->nr_cntr; i++)
 			ret += sprintf(buf, "counter %d: %x\n",
 				       i, config->cntr_val[i]);
 		raw_spin_unlock(&drvdata->spinlock);
 		return ret;
 	}
 
-	for (i = 0; i < drvdata->nr_cntr; i++) {
+	for (i = 0; i < caps->nr_cntr; i++) {
 		val = etm_readl(drvdata, ETMCNTVRn(i));
 		ret += sprintf(buf, "counter %d: %x\n", i, val);
 	}
@@ -999,13 +1005,14 @@ static ssize_t ctxid_idx_store(struct device *dev,
 	int ret;
 	unsigned long val;
 	struct etm_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etm_caps *caps = &drvdata->caps;
 	struct etm_config *config = &drvdata->config;
 
 	ret = kstrtoul(buf, 16, &val);
 	if (ret)
 		return ret;
 
-	if (val >= drvdata->nr_ctxid_cmp)
+	if (val >= caps->nr_ctxid_cmp)
 		return -EINVAL;
 
 	/*
-- 
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}



^ permalink raw reply related

* [PATCH v5 11/12] coresight: etm3x: fix inconsistencies with sysfs configuration
From: Yeoreum Yun @ 2026-04-15 16:55 UTC (permalink / raw)
  To: coresight, linux-arm-kernel, linux-kernel
  Cc: suzuki.poulose, mike.leach, james.clark, alexander.shishkin,
	leo.yan, jie.gan, Yeoreum Yun
In-Reply-To: <20260415165528.3369607-1-yeoreum.yun@arm.com>

The current ETM3x configuration via sysfs can lead to the following
inconsistencies:

  - If a configuration is modified via sysfs while a perf session is
    active, the running configuration may differ between before
    a sched-out and after a subsequent sched-in.

To resolve these issues, separate the configuration into:

  - active_config: the configuration applied to the current session
  - config: the configuration set via sysfs

Additionally:

  - Since active_config and related fields are accessed only by the local CPU
    in etm_enable/disable_sysfs_smp_call() (similar to perf enable/disable),
    remove the lock/unlock from the sysfs enable/disable path and
    starting/dying_cpu path except when to access config fields only.

Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
---
 drivers/hwtracing/coresight/coresight-etm.h   |  2 +
 .../coresight/coresight-etm3x-core.c          | 49 ++++++++++---------
 2 files changed, 27 insertions(+), 24 deletions(-)

diff --git a/drivers/hwtracing/coresight/coresight-etm.h b/drivers/hwtracing/coresight/coresight-etm.h
index 932bec82fb47..01f1a7f2559c 100644
--- a/drivers/hwtracing/coresight/coresight-etm.h
+++ b/drivers/hwtracing/coresight/coresight-etm.h
@@ -243,6 +243,7 @@ struct etm_config {
  * @boot_enable:true if we should start tracing at boot time.
  * @os_unlock:	true if access to management registers is allowed.
  * @traceid:	value of the current ID for this component.
+ * @active_config:	structure holding current running configuration parameters.
  * @config:	structure holding configuration parameters.
  */
 struct etm_drvdata {
@@ -258,6 +259,7 @@ struct etm_drvdata {
 	bool				boot_enable;
 	bool				os_unlock;
 	u32				traceid;
+	struct etm_config		active_config;
 	struct etm_config		config;
 };
 
diff --git a/drivers/hwtracing/coresight/coresight-etm3x-core.c b/drivers/hwtracing/coresight/coresight-etm3x-core.c
index e42ca346da91..706a4c7a40df 100644
--- a/drivers/hwtracing/coresight/coresight-etm3x-core.c
+++ b/drivers/hwtracing/coresight/coresight-etm3x-core.c
@@ -309,7 +309,7 @@ static int etm_parse_event_config(struct etm_drvdata *drvdata,
 				  struct perf_event *event)
 {
 	const struct etm_caps *caps = &drvdata->caps;
-	struct etm_config *config = &drvdata->config;
+	struct etm_config *config = &drvdata->active_config;
 	struct perf_event_attr *attr = &event->attr;
 	u8 ts_level;
 
@@ -368,7 +368,7 @@ static int etm_enable_hw(struct etm_drvdata *drvdata)
 	int i, rc;
 	u32 etmcr;
 	const struct etm_caps *caps = &drvdata->caps;
-	struct etm_config *config = &drvdata->config;
+	struct etm_config *config = &drvdata->active_config;
 	struct coresight_device *csdev = drvdata->csdev;
 
 	CS_UNLOCK(drvdata->csa.base);
@@ -442,29 +442,38 @@ static int etm_enable_hw(struct etm_drvdata *drvdata)
 
 struct etm_enable_arg {
 	struct etm_drvdata *drvdata;
+	u32 traceid;
+	struct etm_config config;
 	int rc;
 };
 
 static void etm_enable_sysfs_smp_call(void *info)
 {
 	struct etm_enable_arg *arg = info;
+	struct etm_drvdata *drvdata;
 	struct coresight_device *csdev;
 
 	if (WARN_ON(!arg))
 		return;
 
-	csdev = arg->drvdata->csdev;
+	drvdata = arg->drvdata;
+	csdev = drvdata->csdev;
 	if (!coresight_take_mode(csdev, CS_MODE_SYSFS)) {
 		/* Someone is already using the tracer */
 		arg->rc = -EBUSY;
 		return;
 	}
 
+	drvdata->active_config = arg->config;
+	drvdata->traceid = arg->traceid;
+
 	arg->rc = etm_enable_hw(arg->drvdata);
 
 	/* The tracer didn't start */
 	if (arg->rc)
 		coresight_set_mode(csdev, CS_MODE_DISABLED);
+	else
+		drvdata->sticky_enable = true;
 }
 
 static int etm_cpu_id(struct coresight_device *csdev)
@@ -512,9 +521,6 @@ static int etm_enable_sysfs(struct coresight_device *csdev, struct coresight_pat
 	struct etm_enable_arg arg = { };
 	int ret;
 
-	raw_spin_lock(&drvdata->spinlock);
-
-	drvdata->traceid = path->trace_id;
 
 	/*
 	 * Configure the ETM only if the CPU is online.  If it isn't online
@@ -522,23 +528,25 @@ static int etm_enable_sysfs(struct coresight_device *csdev, struct coresight_pat
 	 */
 	if (cpu_online(drvdata->cpu)) {
 		arg.drvdata = drvdata;
+		arg.traceid = path->trace_id;
+
+		raw_spin_lock(&drvdata->spinlock);
+		arg.config = drvdata->config;
+		raw_spin_unlock(&drvdata->spinlock);
+
 		ret = smp_call_function_single(drvdata->cpu,
 					       etm_enable_sysfs_smp_call, &arg, 1);
 		if (!ret)
 			ret = arg.rc;
-		if (!ret)
-			drvdata->sticky_enable = true;
 	} else {
 		ret = -ENODEV;
 	}
 
-	if (ret)
-		etm_release_trace_id(drvdata);
-
-	raw_spin_unlock(&drvdata->spinlock);
-
 	if (!ret)
 		dev_dbg(&csdev->dev, "ETM tracing enabled\n");
+	else
+		etm_release_trace_id(drvdata);
+
 	return ret;
 }
 
@@ -565,7 +573,7 @@ static void etm_disable_hw(struct etm_drvdata *drvdata)
 {
 	int i;
 	const struct etm_caps *caps = &drvdata->caps;
-	struct etm_config *config = &drvdata->config;
+	struct etm_config *config = &drvdata->active_config;
 	struct coresight_device *csdev = drvdata->csdev;
 
 	CS_UNLOCK(drvdata->csa.base);
@@ -636,7 +644,6 @@ static void etm_disable_sysfs(struct coresight_device *csdev)
 	 * DYING hotplug callback is serviced by the ETM driver.
 	 */
 	cpus_read_lock();
-	raw_spin_lock(&drvdata->spinlock);
 
 	/*
 	 * Executing etm_disable_hw on the cpu whose ETM is being disabled
@@ -645,7 +652,6 @@ static void etm_disable_sysfs(struct coresight_device *csdev)
 	smp_call_function_single(drvdata->cpu, etm_disable_sysfs_smp_call,
 				 drvdata, 1);
 
-	raw_spin_unlock(&drvdata->spinlock);
 	cpus_read_unlock();
 
 	/*
@@ -711,15 +717,11 @@ static int etm_starting_cpu(unsigned int cpu)
 	if (!etmdrvdata[cpu])
 		return 0;
 
-	raw_spin_lock(&etmdrvdata[cpu]->spinlock);
-	if (!etmdrvdata[cpu]->os_unlock) {
+	if (!etmdrvdata[cpu]->os_unlock)
 		etm_os_unlock(etmdrvdata[cpu]);
-		etmdrvdata[cpu]->os_unlock = true;
-	}
 
 	if (coresight_get_mode(etmdrvdata[cpu]->csdev))
 		etm_enable_hw(etmdrvdata[cpu]);
-	raw_spin_unlock(&etmdrvdata[cpu]->spinlock);
 	return 0;
 }
 
@@ -728,10 +730,8 @@ static int etm_dying_cpu(unsigned int cpu)
 	if (!etmdrvdata[cpu])
 		return 0;
 
-	raw_spin_lock(&etmdrvdata[cpu]->spinlock);
 	if (coresight_get_mode(etmdrvdata[cpu]->csdev))
 		etm_disable_hw(etmdrvdata[cpu]);
-	raw_spin_unlock(&etmdrvdata[cpu]->spinlock);
 	return 0;
 }
 
@@ -884,7 +884,8 @@ static int etm_probe(struct amba_device *adev, const struct amba_id *id)
 	if (etm_arch_supported(drvdata->arch) == false)
 		return -EINVAL;
 
-	etm_set_default(&drvdata->config);
+	etm_set_default(&drvdata->active_config);
+	drvdata->config = drvdata->active_config;
 
 	pdata = coresight_get_platform_data(dev);
 	if (IS_ERR(pdata))
-- 
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}



^ permalink raw reply related

* [PATCH v5 08/12] coresight: etm4x: remove redundant call etm4_enable_hw() with hotplug
From: Yeoreum Yun @ 2026-04-15 16:55 UTC (permalink / raw)
  To: coresight, linux-arm-kernel, linux-kernel
  Cc: suzuki.poulose, mike.leach, james.clark, alexander.shishkin,
	leo.yan, jie.gan, Yeoreum Yun
In-Reply-To: <20260415165528.3369607-1-yeoreum.yun@arm.com>

The cpu_online_mask is set at the CPUHP_BRINGUP_CPU step.
In other words, if etm4_enable_sysfs() is called between
CPUHP_BRINGUP_CPU and CPUHP_AP_ARM_CORESIGHT_STARTING,
etm4_enable_hw() may be invoked in etm4_enable_sysfs_smp_call()
and then executed again in etm4_starting_cpu().

To remove this redundant call, take the hotplug lock before executing
etm4_enable_sysfs_smp_call().

Reviewed-by: Jie Gan <jie.gan@oss.qualcomm.com>
Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
---
 drivers/hwtracing/coresight/coresight-etm4x-core.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/hwtracing/coresight/coresight-etm4x-core.c b/drivers/hwtracing/coresight/coresight-etm4x-core.c
index 15aaf4a898e1..2ce6b08f7f54 100644
--- a/drivers/hwtracing/coresight/coresight-etm4x-core.c
+++ b/drivers/hwtracing/coresight/coresight-etm4x-core.c
@@ -960,8 +960,20 @@ static int etm4_enable_sysfs(struct coresight_device *csdev, struct coresight_pa
 	arg.config = drvdata->config;
 	raw_spin_unlock(&drvdata->spinlock);
 
+	/*
+	 * Take the hotplug lock to prevent redundant calls to etm4_enable_hw().
+	 *
+	 * The cpu_online_mask is set at the CPUHP_BRINGUP_CPU step.
+	 * In other words, if etm4_enable_sysfs() is called between
+	 * CPUHP_BRINGUP_CPU and CPUHP_AP_ARM_CORESIGHT_STARTING,
+	 * etm4_enable_hw() may be invoked in etm4_enable_sysfs_smp_call()
+	 * and then executed again in etm4_starting_cpu().
+	 */
+	cpus_read_lock();
 	ret = smp_call_function_single(drvdata->cpu,
 				       etm4_enable_sysfs_smp_call, &arg, 1);
+	cpus_read_unlock();
+
 	if (!ret)
 		ret = arg.rc;
 	if (!ret)
-- 
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}



^ permalink raw reply related

* [PATCH v5 07/12] coresight: etm4x: fix inconsistencies with sysfs configuration
From: Yeoreum Yun @ 2026-04-15 16:55 UTC (permalink / raw)
  To: coresight, linux-arm-kernel, linux-kernel
  Cc: suzuki.poulose, mike.leach, james.clark, alexander.shishkin,
	leo.yan, jie.gan, Yeoreum Yun
In-Reply-To: <20260415165528.3369607-1-yeoreum.yun@arm.com>

The current ETM4x configuration via sysfs can lead to
several inconsistencies:

  - If the configuration is modified via sysfs while a perf session is
    active, the running configuration may differ before a sched-out and
    after a subsequent sched-in.

  - If a perf session and a sysfs session enable tracing concurrently,
    the configuration from configfs may become corrupted.

  - There is a risk of corrupting drvdata->config if a perf session enables
    tracing while cscfg_csdev_disable_active_config() is being handled in
    etm4_disable_sysfs().

To resolve these issues, separate the configuration into:

  - active_config: the configuration applied to the current session
  - config: the configuration set via sysfs

Additionally:

  - Apply the configuration from configfs after taking the appropriate mode.

  - Since active_config and related fields are accessed only by the local CPU
    in etm4_enable/disable_sysfs_smp_call() (similar to perf enable/disable),
    remove the lock/unlock from the sysfs enable/disable path and
    startup/dying_cpu except when to access config fields.

Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
---
 .../hwtracing/coresight/coresight-etm4x-cfg.c |   2 +-
 .../coresight/coresight-etm4x-core.c          | 107 ++++++++++--------
 drivers/hwtracing/coresight/coresight-etm4x.h |   2 +
 3 files changed, 63 insertions(+), 48 deletions(-)

diff --git a/drivers/hwtracing/coresight/coresight-etm4x-cfg.c b/drivers/hwtracing/coresight/coresight-etm4x-cfg.c
index d14d7c8a23e5..0553771d04e7 100644
--- a/drivers/hwtracing/coresight/coresight-etm4x-cfg.c
+++ b/drivers/hwtracing/coresight/coresight-etm4x-cfg.c
@@ -47,7 +47,7 @@ static int etm4_cfg_map_reg_offset(struct etmv4_drvdata *drvdata,
 				   struct cscfg_regval_csdev *reg_csdev, u32 offset)
 {
 	int err = -EINVAL, idx;
-	struct etmv4_config *drvcfg = &drvdata->config;
+	struct etmv4_config *drvcfg = &drvdata->active_config;
 	u32 off_mask;
 
 	if (((offset >= TRCEVENTCTL0R) && (offset <= TRCVIPCSSCTLR)) ||
diff --git a/drivers/hwtracing/coresight/coresight-etm4x-core.c b/drivers/hwtracing/coresight/coresight-etm4x-core.c
index b199aebbdb60..15aaf4a898e1 100644
--- a/drivers/hwtracing/coresight/coresight-etm4x-core.c
+++ b/drivers/hwtracing/coresight/coresight-etm4x-core.c
@@ -245,6 +245,10 @@ void etm4_release_trace_id(struct etmv4_drvdata *drvdata)
 
 struct etm4_enable_arg {
 	struct etmv4_drvdata *drvdata;
+	unsigned long cfg_hash;
+	int preset;
+	u8 trace_id;
+	struct etmv4_config config;
 	int rc;
 };
 
@@ -270,10 +274,11 @@ static void etm4x_prohibit_trace(struct etmv4_drvdata *drvdata)
 static u64 etm4x_get_kern_user_filter(struct etmv4_drvdata *drvdata)
 {
 	u64 trfcr = drvdata->trfcr;
+	struct etmv4_config *config = &drvdata->active_config;
 
-	if (drvdata->config.mode & ETM_MODE_EXCL_KERN)
+	if (config->mode & ETM_MODE_EXCL_KERN)
 		trfcr &= ~TRFCR_EL1_ExTRE;
-	if (drvdata->config.mode & ETM_MODE_EXCL_USER)
+	if (config->mode & ETM_MODE_EXCL_USER)
 		trfcr &= ~TRFCR_EL1_E0TRE;
 
 	return trfcr;
@@ -281,7 +286,7 @@ static u64 etm4x_get_kern_user_filter(struct etmv4_drvdata *drvdata)
 
 /*
  * etm4x_allow_trace - Allow CPU tracing in the respective ELs,
- * as configured by the drvdata->config.mode for the current
+ * as configured by the drvdata->active_config.mode for the current
  * session. Even though we have TRCVICTLR bits to filter the
  * trace in the ELs, it doesn't prevent the ETM from generating
  * a packet (e.g, TraceInfo) that might contain the addresses from
@@ -292,12 +297,13 @@ static u64 etm4x_get_kern_user_filter(struct etmv4_drvdata *drvdata)
 static void etm4x_allow_trace(struct etmv4_drvdata *drvdata)
 {
 	u64 trfcr, guest_trfcr;
+	struct etmv4_config *config = &drvdata->active_config;
 
 	/* If the CPU doesn't support FEAT_TRF, nothing to do */
 	if (!drvdata->trfcr)
 		return;
 
-	if (drvdata->config.mode & ETM_MODE_EXCL_HOST)
+	if (config->mode & ETM_MODE_EXCL_HOST)
 		trfcr = drvdata->trfcr & ~(TRFCR_EL1_ExTRE | TRFCR_EL1_E0TRE);
 	else
 		trfcr = etm4x_get_kern_user_filter(drvdata);
@@ -305,7 +311,7 @@ static void etm4x_allow_trace(struct etmv4_drvdata *drvdata)
 	write_trfcr(trfcr);
 
 	/* Set filters for guests and pass to KVM */
-	if (drvdata->config.mode & ETM_MODE_EXCL_GUEST)
+	if (config->mode & ETM_MODE_EXCL_GUEST)
 		guest_trfcr = drvdata->trfcr & ~(TRFCR_EL1_ExTRE | TRFCR_EL1_E0TRE);
 	else
 		guest_trfcr = etm4x_get_kern_user_filter(drvdata);
@@ -499,7 +505,7 @@ static int etm4_enable_hw(struct etmv4_drvdata *drvdata)
 {
 	int i, rc;
 	const struct etmv4_caps *caps = &drvdata->caps;
-	struct etmv4_config *config = &drvdata->config;
+	struct etmv4_config *config = &drvdata->active_config;
 	struct coresight_device *csdev = drvdata->csdev;
 	struct device *etm_dev = &csdev->dev;
 	struct csdev_access *csa = &csdev->access;
@@ -618,23 +624,45 @@ static int etm4_enable_hw(struct etmv4_drvdata *drvdata)
 static void etm4_enable_sysfs_smp_call(void *info)
 {
 	struct etm4_enable_arg *arg = info;
+	struct etmv4_drvdata *drvdata;
 	struct coresight_device *csdev;
 
 	if (WARN_ON(!arg))
 		return;
 
-	csdev = arg->drvdata->csdev;
+	drvdata = arg->drvdata;
+	csdev = drvdata->csdev;
 	if (!coresight_take_mode(csdev, CS_MODE_SYSFS)) {
 		/* Someone is already using the tracer */
 		arg->rc = -EBUSY;
 		return;
 	}
 
-	arg->rc = etm4_enable_hw(arg->drvdata);
+	drvdata->active_config = arg->config;
 
-	/* The tracer didn't start */
+	if (arg->cfg_hash) {
+		arg->rc = cscfg_csdev_enable_active_config(csdev,
+							   arg->cfg_hash,
+							   arg->preset);
+		if (arg->rc)
+			goto err;
+	}
+
+	drvdata->trcid = arg->trace_id;
+
+	/* Tracer will never be paused in sysfs mode */
+	drvdata->paused = false;
+
+	arg->rc = etm4_enable_hw(drvdata);
 	if (arg->rc)
-		coresight_set_mode(csdev, CS_MODE_DISABLED);
+		goto err;
+
+	drvdata->sticky_enable = true;
+
+	return;
+err:
+	/* The tracer didn't start */
+	coresight_set_mode(csdev, CS_MODE_DISABLED);
 }
 
 /*
@@ -672,7 +700,7 @@ static int etm4_config_timestamp_event(struct etmv4_drvdata *drvdata,
 	int ctridx;
 	int rselector;
 	const struct etmv4_caps *caps = &drvdata->caps;
-	struct etmv4_config *config = &drvdata->config;
+	struct etmv4_config *config = &drvdata->active_config;
 
 	/* No point in trying if we don't have at least one counter */
 	if (!caps->nr_cntr)
@@ -756,7 +784,7 @@ static int etm4_parse_event_config(struct coresight_device *csdev,
 	int ret = 0;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(csdev->dev.parent);
 	const struct etmv4_caps *caps = &drvdata->caps;
-	struct etmv4_config *config = &drvdata->config;
+	struct etmv4_config *config = &drvdata->active_config;
 	struct perf_event_attr max_timestamp = {
 		.ATTR_CFG_FLD_timestamp_CFG = U64_MAX,
 	};
@@ -918,40 +946,29 @@ static int etm4_enable_sysfs(struct coresight_device *csdev, struct coresight_pa
 
 	/* enable any config activated by configfs */
 	cscfg_config_sysfs_get_active_cfg(&cfg_hash, &preset);
-	if (cfg_hash) {
-		ret = cscfg_csdev_enable_active_config(csdev, cfg_hash, preset);
-		if (ret) {
-			etm4_release_trace_id(drvdata);
-			return ret;
-		}
-	}
-
-	raw_spin_lock(&drvdata->spinlock);
-
-	drvdata->trcid = path->trace_id;
-
-	/* Tracer will never be paused in sysfs mode */
-	drvdata->paused = false;
 
 	/*
 	 * Executing etm4_enable_hw on the cpu whose ETM is being enabled
 	 * ensures that register writes occur when cpu is powered.
 	 */
 	arg.drvdata = drvdata;
+	arg.cfg_hash = cfg_hash;
+	arg.preset = preset;
+	arg.trace_id = path->trace_id;
+
+	raw_spin_lock(&drvdata->spinlock);
+	arg.config = drvdata->config;
+	raw_spin_unlock(&drvdata->spinlock);
+
 	ret = smp_call_function_single(drvdata->cpu,
 				       etm4_enable_sysfs_smp_call, &arg, 1);
 	if (!ret)
 		ret = arg.rc;
 	if (!ret)
-		drvdata->sticky_enable = true;
-
-	if (ret)
+		dev_dbg(&csdev->dev, "ETM tracing enabled\n");
+	else
 		etm4_release_trace_id(drvdata);
 
-	raw_spin_unlock(&drvdata->spinlock);
-
-	if (!ret)
-		dev_dbg(&csdev->dev, "ETM tracing enabled\n");
 	return ret;
 }
 
@@ -1038,7 +1055,7 @@ static void etm4_disable_hw(struct etmv4_drvdata *drvdata)
 {
 	u32 control;
 	const struct etmv4_caps *caps = &drvdata->caps;
-	struct etmv4_config *config = &drvdata->config;
+	struct etmv4_config *config = &drvdata->active_config;
 	struct coresight_device *csdev = drvdata->csdev;
 	struct csdev_access *csa = &csdev->access;
 	int i;
@@ -1074,6 +1091,8 @@ static void etm4_disable_sysfs_smp_call(void *info)
 
 	etm4_disable_hw(drvdata);
 
+	cscfg_csdev_disable_active_config(drvdata->csdev);
+
 	coresight_set_mode(drvdata->csdev, CS_MODE_DISABLED);
 }
 
@@ -1124,7 +1143,6 @@ static void etm4_disable_sysfs(struct coresight_device *csdev)
 	 * DYING hotplug callback is serviced by the ETM driver.
 	 */
 	cpus_read_lock();
-	raw_spin_lock(&drvdata->spinlock);
 
 	/*
 	 * Executing etm4_disable_hw on the cpu whose ETM is being disabled
@@ -1133,10 +1151,6 @@ static void etm4_disable_sysfs(struct coresight_device *csdev)
 	smp_call_function_single(drvdata->cpu, etm4_disable_sysfs_smp_call,
 				 drvdata, 1);
 
-	raw_spin_unlock(&drvdata->spinlock);
-
-	cscfg_csdev_disable_active_config(csdev);
-
 	cpus_read_unlock();
 
 	/*
@@ -1379,6 +1393,7 @@ static void etm4_init_arch_data(void *info)
 	struct etm4_init_arg *init_arg = info;
 	struct etmv4_drvdata *drvdata;
 	struct etmv4_caps *caps;
+	struct etmv4_config *config;
 	struct csdev_access *csa;
 	struct device *dev = init_arg->dev;
 	int i;
@@ -1386,6 +1401,7 @@ static void etm4_init_arch_data(void *info)
 	drvdata = dev_get_drvdata(init_arg->dev);
 	caps = &drvdata->caps;
 	csa = init_arg->csa;
+	config = &drvdata->active_config;
 
 	/*
 	 * If we are unable to detect the access mechanism,
@@ -1446,7 +1462,7 @@ static void etm4_init_arch_data(void *info)
 
 	/* EXLEVEL_S, bits[19:16] Secure state instruction tracing */
 	caps->s_ex_level = FIELD_GET(TRCIDR3_EXLEVEL_S_MASK, etmidr3);
-	drvdata->config.s_ex_level = caps->s_ex_level;
+	config->s_ex_level = caps->s_ex_level;
 	/* EXLEVEL_NS, bits[23:20] Non-secure state instruction tracing */
 	caps->ns_ex_level = FIELD_GET(TRCIDR3_EXLEVEL_NS_MASK, etmidr3);
 	/*
@@ -1692,7 +1708,7 @@ static void etm4_set_default(struct etmv4_config *config)
 static int etm4_get_next_comparator(struct etmv4_drvdata *drvdata, u32 type)
 {
 	int nr_comparator, index = 0;
-	struct etmv4_config *config = &drvdata->config;
+	struct etmv4_config *config = &drvdata->active_config;
 
 	/*
 	 * nr_addr_cmp holds the number of comparator _pair_, so time 2
@@ -1733,7 +1749,7 @@ static int etm4_set_event_filters(struct etmv4_drvdata *drvdata,
 {
 	int i, comparator, ret = 0;
 	u64 address;
-	struct etmv4_config *config = &drvdata->config;
+	struct etmv4_config *config = &drvdata->active_config;
 	struct etm_filters *filters = event->hw.addr_filters;
 
 	if (!filters)
@@ -1851,13 +1867,11 @@ static int etm4_starting_cpu(unsigned int cpu)
 	if (!etmdrvdata[cpu])
 		return 0;
 
-	raw_spin_lock(&etmdrvdata[cpu]->spinlock);
 	if (!etmdrvdata[cpu]->os_unlock)
 		etm4_os_unlock(etmdrvdata[cpu]);
 
 	if (coresight_get_mode(etmdrvdata[cpu]->csdev))
 		etm4_enable_hw(etmdrvdata[cpu]);
-	raw_spin_unlock(&etmdrvdata[cpu]->spinlock);
 	return 0;
 }
 
@@ -1866,10 +1880,8 @@ static int etm4_dying_cpu(unsigned int cpu)
 	if (!etmdrvdata[cpu])
 		return 0;
 
-	raw_spin_lock(&etmdrvdata[cpu]->spinlock);
 	if (coresight_get_mode(etmdrvdata[cpu]->csdev))
 		etm4_disable_hw(etmdrvdata[cpu]);
-	raw_spin_unlock(&etmdrvdata[cpu]->spinlock);
 	return 0;
 }
 
@@ -2255,7 +2267,8 @@ static int etm4_add_coresight_dev(struct etm4_init_arg *init_arg)
 	if (!desc.name)
 		return -ENOMEM;
 
-	etm4_set_default(&drvdata->config);
+	etm4_set_default(&drvdata->active_config);
+	drvdata->config = drvdata->active_config;
 
 	pdata = coresight_get_platform_data(dev);
 	if (IS_ERR(pdata))
diff --git a/drivers/hwtracing/coresight/coresight-etm4x.h b/drivers/hwtracing/coresight/coresight-etm4x.h
index cbd8890d166a..9b50aaa368cf 100644
--- a/drivers/hwtracing/coresight/coresight-etm4x.h
+++ b/drivers/hwtracing/coresight/coresight-etm4x.h
@@ -1069,6 +1069,7 @@ struct etmv4_save_state {
  *		allows tracing at all ELs. We don't want to compute this
  *		at runtime, due to the additional setting of TRFCR_CX when
  *		in EL2. Otherwise, 0.
+ * @active_config:	structure holding current applied configuration parameters.
  * @config:	structure holding configuration parameters.
  * @save_state:	State to be preserved across power loss
  * @paused:	Indicates if the trace unit is paused.
@@ -1089,6 +1090,7 @@ struct etmv4_drvdata {
 	bool				os_unlock : 1;
 	bool				paused : 1;
 	u64				trfcr;
+	struct etmv4_config		active_config;
 	struct etmv4_config		config;
 	struct etmv4_save_state		*save_state;
 	DECLARE_BITMAP(arch_features, ETM4_IMPDEF_FEATURE_MAX);
-- 
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}



^ permalink raw reply related

* [PATCH v5 06/12] coresight: etm4x: fix leaked trace id
From: Yeoreum Yun @ 2026-04-15 16:55 UTC (permalink / raw)
  To: coresight, linux-arm-kernel, linux-kernel
  Cc: suzuki.poulose, mike.leach, james.clark, alexander.shishkin,
	leo.yan, jie.gan, Yeoreum Yun
In-Reply-To: <20260415165528.3369607-1-yeoreum.yun@arm.com>

If etm4_enable_sysfs() fails in cscfg_csdev_enable_active_config(),
the trace ID may be leaked because it is not released.

To address this, call etm4_release_trace_id() when etm4_enable_sysfs()
fails in cscfg_csdev_enable_active_config().

Reviewed-by: Jie Gan <jie.gan@oss.qualcomm.com>
Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
---
 drivers/hwtracing/coresight/coresight-etm4x-core.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/hwtracing/coresight/coresight-etm4x-core.c b/drivers/hwtracing/coresight/coresight-etm4x-core.c
index f55338a4989d..b199aebbdb60 100644
--- a/drivers/hwtracing/coresight/coresight-etm4x-core.c
+++ b/drivers/hwtracing/coresight/coresight-etm4x-core.c
@@ -920,8 +920,10 @@ static int etm4_enable_sysfs(struct coresight_device *csdev, struct coresight_pa
 	cscfg_config_sysfs_get_active_cfg(&cfg_hash, &preset);
 	if (cfg_hash) {
 		ret = cscfg_csdev_enable_active_config(csdev, cfg_hash, preset);
-		if (ret)
+		if (ret) {
+			etm4_release_trace_id(drvdata);
 			return ret;
+		}
 	}
 
 	raw_spin_lock(&drvdata->spinlock);
-- 
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}



^ permalink raw reply related

* [PATCH v5 03/12] coresight: etm4x: introduce struct etm4_caps
From: Yeoreum Yun @ 2026-04-15 16:55 UTC (permalink / raw)
  To: coresight, linux-arm-kernel, linux-kernel
  Cc: suzuki.poulose, mike.leach, james.clark, alexander.shishkin,
	leo.yan, jie.gan, Yeoreum Yun
In-Reply-To: <20260415165528.3369607-1-yeoreum.yun@arm.com>

Introduce struct etm4_caps to describe ETMv4 capabilities
and move capabilities information into it.

Reviewed-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
---
 .../coresight/coresight-etm4x-core.c          | 234 +++++++++---------
 .../coresight/coresight-etm4x-sysfs.c         | 192 ++++++++------
 drivers/hwtracing/coresight/coresight-etm4x.h | 176 ++++++-------
 3 files changed, 329 insertions(+), 273 deletions(-)

diff --git a/drivers/hwtracing/coresight/coresight-etm4x-core.c b/drivers/hwtracing/coresight/coresight-etm4x-core.c
index ba5b8b423bd4..b2b092a76eb5 100644
--- a/drivers/hwtracing/coresight/coresight-etm4x-core.c
+++ b/drivers/hwtracing/coresight/coresight-etm4x-core.c
@@ -88,8 +88,9 @@ static int etm4_probe_cpu(unsigned int cpu);
  */
 static bool etm4x_sspcicrn_present(struct etmv4_drvdata *drvdata, int n)
 {
-	return (n < drvdata->nr_ss_cmp) &&
-	       drvdata->nr_pe_cmp &&
+	const struct etmv4_caps *caps = &drvdata->caps;
+
+	return (n < caps->nr_ss_cmp) && caps->nr_pe_cmp &&
 	       (drvdata->config.ss_status[n] & TRCSSCSRn_PC);
 }
 
@@ -160,17 +161,20 @@ static void ete_sysreg_write(u64 val, u32 offset, bool _relaxed, bool _64bit)
 static void etm_detect_os_lock(struct etmv4_drvdata *drvdata,
 			       struct csdev_access *csa)
 {
+	struct etmv4_caps *caps = &drvdata->caps;
 	u32 oslsr = etm4x_relaxed_read32(csa, TRCOSLSR);
 
-	drvdata->os_lock_model = ETM_OSLSR_OSLM(oslsr);
+	caps->os_lock_model = ETM_OSLSR_OSLM(oslsr);
 }
 
 static void etm_write_os_lock(struct etmv4_drvdata *drvdata,
 			      struct csdev_access *csa, u32 val)
 {
+	const struct etmv4_caps *caps = &drvdata->caps;
+
 	val = !!val;
 
-	switch (drvdata->os_lock_model) {
+	switch (caps->os_lock_model) {
 	case ETM_OSLOCK_PRESENT:
 		etm4x_relaxed_write32(csa, val, TRCOSLAR);
 		break;
@@ -179,7 +183,7 @@ static void etm_write_os_lock(struct etmv4_drvdata *drvdata,
 		break;
 	default:
 		pr_warn_once("CPU%d: Unsupported Trace OSLock model: %x\n",
-			     smp_processor_id(), drvdata->os_lock_model);
+			     smp_processor_id(), caps->os_lock_model);
 		fallthrough;
 	case ETM_OSLOCK_NI:
 		return;
@@ -494,6 +498,7 @@ static int etm4_enable_trace_unit(struct etmv4_drvdata *drvdata)
 static int etm4_enable_hw(struct etmv4_drvdata *drvdata)
 {
 	int i, rc;
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 	struct coresight_device *csdev = drvdata->csdev;
 	struct device *etm_dev = &csdev->dev;
@@ -525,14 +530,14 @@ static int etm4_enable_hw(struct etmv4_drvdata *drvdata)
 	if (etm4x_wait_status(csa, TRCSTATR_IDLE_BIT, 1))
 		dev_err(etm_dev,
 			"timeout while waiting for Idle Trace Status\n");
-	if (drvdata->nr_pe)
+	if (caps->nr_pe)
 		etm4x_relaxed_write32(csa, config->pe_sel, TRCPROCSELR);
 	etm4x_relaxed_write32(csa, config->cfg, TRCCONFIGR);
 	/* nothing specific implemented */
 	etm4x_relaxed_write32(csa, 0x0, TRCAUXCTLR);
 	etm4x_relaxed_write32(csa, config->eventctrl0, TRCEVENTCTL0R);
 	etm4x_relaxed_write32(csa, config->eventctrl1, TRCEVENTCTL1R);
-	if (drvdata->stallctl)
+	if (caps->stallctl)
 		etm4x_relaxed_write32(csa, config->stall_ctrl, TRCSTALLCTLR);
 	etm4x_relaxed_write32(csa, config->ts_ctrl, TRCTSCTLR);
 	etm4x_relaxed_write32(csa, config->syncfreq, TRCSYNCPR);
@@ -542,19 +547,19 @@ static int etm4_enable_hw(struct etmv4_drvdata *drvdata)
 	etm4x_relaxed_write32(csa, config->vinst_ctrl, TRCVICTLR);
 	etm4x_relaxed_write32(csa, config->viiectlr, TRCVIIECTLR);
 	etm4x_relaxed_write32(csa, config->vissctlr, TRCVISSCTLR);
-	if (drvdata->nr_pe_cmp)
+	if (caps->nr_pe_cmp)
 		etm4x_relaxed_write32(csa, config->vipcssctlr, TRCVIPCSSCTLR);
 
-	if (drvdata->nrseqstate) {
-		for (i = 0; i < drvdata->nrseqstate - 1; i++)
+	if (caps->nrseqstate) {
+		for (i = 0; i < caps->nrseqstate - 1; i++)
 			etm4x_relaxed_write32(csa, config->seq_ctrl[i], TRCSEQEVRn(i));
 
 		etm4x_relaxed_write32(csa, config->seq_rst, TRCSEQRSTEVR);
 		etm4x_relaxed_write32(csa, config->seq_state, TRCSEQSTR);
 	}
-	if (drvdata->numextinsel)
+	if (caps->numextinsel)
 		etm4x_relaxed_write32(csa, config->ext_inp, TRCEXTINSELR);
-	for (i = 0; i < drvdata->nr_cntr; i++) {
+	for (i = 0; i < caps->nr_cntr; i++) {
 		etm4x_relaxed_write32(csa, config->cntrldvr[i], TRCCNTRLDVRn(i));
 		etm4x_relaxed_write32(csa, config->cntr_ctrl[i], TRCCNTCTLRn(i));
 		etm4x_relaxed_write32(csa, config->cntr_val[i], TRCCNTVRn(i));
@@ -564,10 +569,10 @@ static int etm4_enable_hw(struct etmv4_drvdata *drvdata)
 	 * Resource selector pair 0 is always implemented and reserved.  As
 	 * such start at 2.
 	 */
-	for (i = 2; i < drvdata->nr_resource * 2; i++)
+	for (i = 2; i < caps->nr_resource * 2; i++)
 		etm4x_relaxed_write32(csa, config->res_ctrl[i], TRCRSCTLRn(i));
 
-	for (i = 0; i < drvdata->nr_ss_cmp; i++) {
+	for (i = 0; i < caps->nr_ss_cmp; i++) {
 		/* always clear status bit on restart if using single-shot */
 		if (config->ss_ctrl[i] || config->ss_pe_cmp[i])
 			config->ss_status[i] &= ~TRCSSCSRn_STATUS;
@@ -576,23 +581,23 @@ static int etm4_enable_hw(struct etmv4_drvdata *drvdata)
 		if (etm4x_sspcicrn_present(drvdata, i))
 			etm4x_relaxed_write32(csa, config->ss_pe_cmp[i], TRCSSPCICRn(i));
 	}
-	for (i = 0; i < drvdata->nr_addr_cmp * 2; i++) {
+	for (i = 0; i < caps->nr_addr_cmp * 2; i++) {
 		etm4x_relaxed_write64(csa, config->addr_val[i], TRCACVRn(i));
 		etm4x_relaxed_write64(csa, config->addr_acc[i], TRCACATRn(i));
 	}
-	for (i = 0; i < drvdata->numcidc; i++)
+	for (i = 0; i < caps->numcidc; i++)
 		etm4x_relaxed_write64(csa, config->ctxid_pid[i], TRCCIDCVRn(i));
 	etm4x_relaxed_write32(csa, config->ctxid_mask0, TRCCIDCCTLR0);
-	if (drvdata->numcidc > 4)
+	if (caps->numcidc > 4)
 		etm4x_relaxed_write32(csa, config->ctxid_mask1, TRCCIDCCTLR1);
 
-	for (i = 0; i < drvdata->numvmidc; i++)
+	for (i = 0; i < caps->numvmidc; i++)
 		etm4x_relaxed_write64(csa, config->vmid_val[i], TRCVMIDCVRn(i));
 	etm4x_relaxed_write32(csa, config->vmid_mask0, TRCVMIDCCTLR0);
-	if (drvdata->numvmidc > 4)
+	if (caps->numvmidc > 4)
 		etm4x_relaxed_write32(csa, config->vmid_mask1, TRCVMIDCCTLR1);
 
-	if (!drvdata->skip_power_up) {
+	if (!caps->skip_power_up) {
 		u32 trcpdcr = etm4x_relaxed_read32(csa, TRCPDCR);
 
 		/*
@@ -668,19 +673,20 @@ static int etm4_config_timestamp_event(struct etmv4_drvdata *drvdata,
 {
 	int ctridx;
 	int rselector;
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 
 	/* No point in trying if we don't have at least one counter */
-	if (!drvdata->nr_cntr)
+	if (!caps->nr_cntr)
 		return -EINVAL;
 
 	/* Find a counter that hasn't been initialised */
-	for (ctridx = 0; ctridx < drvdata->nr_cntr; ctridx++)
+	for (ctridx = 0; ctridx < caps->nr_cntr; ctridx++)
 		if (config->cntr_val[ctridx] == 0)
 			break;
 
 	/* All the counters have been configured already, bail out */
-	if (ctridx == drvdata->nr_cntr) {
+	if (ctridx == caps->nr_cntr) {
 		pr_debug("%s: no available counter found\n", __func__);
 		return -ENOSPC;
 	}
@@ -696,11 +702,11 @@ static int etm4_config_timestamp_event(struct etmv4_drvdata *drvdata,
 	 * ETMIDR4 gives the number of resource selector _pairs_, hence multiply
 	 * by 2.
 	 */
-	for (rselector = 2; rselector < drvdata->nr_resource * 2; rselector++)
+	for (rselector = 2; rselector < caps->nr_resource * 2; rselector++)
 		if (!config->res_ctrl[rselector])
 			break;
 
-	if (rselector == drvdata->nr_resource * 2) {
+	if (rselector == caps->nr_resource * 2) {
 		pr_debug("%s: no available resource selector found\n",
 			 __func__);
 		return -ENOSPC;
@@ -751,6 +757,7 @@ static int etm4_parse_event_config(struct coresight_device *csdev,
 {
 	int ret = 0;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(csdev->dev.parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 	struct perf_event_attr max_timestamp = {
 		.ATTR_CFG_FLD_timestamp_CFG = U64_MAX,
@@ -790,8 +797,8 @@ static int etm4_parse_event_config(struct coresight_device *csdev,
 		cc_threshold = ATTR_CFG_GET_FLD(attr, cc_threshold);
 		if (!cc_threshold)
 			cc_threshold = ETM_CYC_THRESHOLD_DEFAULT;
-		if (cc_threshold < drvdata->ccitmin)
-			cc_threshold = drvdata->ccitmin;
+		if (cc_threshold < caps->ccitmin)
+			cc_threshold = caps->ccitmin;
 		config->ccctlr = cc_threshold;
 	}
 
@@ -839,7 +846,7 @@ static int etm4_parse_event_config(struct coresight_device *csdev,
 	}
 
 	/* return stack - enable if selected and supported */
-	if (ATTR_CFG_GET_FLD(attr, retstack) && drvdata->retstack)
+	if (ATTR_CFG_GET_FLD(attr, retstack) && caps->retstack)
 		/* bit[12], Return stack enable bit */
 		config->cfg |= TRCCONFIGR_RS;
 
@@ -855,7 +862,7 @@ static int etm4_parse_event_config(struct coresight_device *csdev,
 
 	/* branch broadcast - enable if selected and supported */
 	if (ATTR_CFG_GET_FLD(attr, branch_broadcast)) {
-		if (!drvdata->trcbb) {
+		if (!caps->trcbb) {
 			/*
 			 * Missing BB support could cause silent decode errors
 			 * so fail to open if it's not supported.
@@ -1030,6 +1037,7 @@ static void etm4_disable_trace_unit(struct etmv4_drvdata *drvdata)
 static void etm4_disable_hw(struct etmv4_drvdata *drvdata)
 {
 	u32 control;
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 	struct coresight_device *csdev = drvdata->csdev;
 	struct csdev_access *csa = &csdev->access;
@@ -1038,7 +1046,7 @@ static void etm4_disable_hw(struct etmv4_drvdata *drvdata)
 	etm4_cs_unlock(drvdata, csa);
 	etm4_disable_arch_specific(drvdata);
 
-	if (!drvdata->skip_power_up) {
+	if (!caps->skip_power_up) {
 		/* power can be removed from the trace unit now */
 		control = etm4x_relaxed_read32(csa, TRCPDCR);
 		control &= ~TRCPDCR_PU;
@@ -1048,13 +1056,13 @@ static void etm4_disable_hw(struct etmv4_drvdata *drvdata)
 	etm4_disable_trace_unit(drvdata);
 
 	/* read the status of the single shot comparators */
-	for (i = 0; i < drvdata->nr_ss_cmp; i++) {
+	for (i = 0; i < caps->nr_ss_cmp; i++) {
 		config->ss_status[i] =
 			etm4x_relaxed_read32(csa, TRCSSCSRn(i));
 	}
 
 	/* read back the current counter values */
-	for (i = 0; i < drvdata->nr_cntr; i++) {
+	for (i = 0; i < caps->nr_cntr; i++) {
 		config->cntr_val[i] =
 			etm4x_relaxed_read32(csa, TRCCNTVRn(i));
 	}
@@ -1352,7 +1360,7 @@ static struct midr_range etm_wrong_ccitmin_cpus[] = {
 	{},
 };
 
-static void etm4_fixup_wrong_ccitmin(struct etmv4_drvdata *drvdata)
+static void etm4_fixup_wrong_ccitmin(struct etmv4_caps *caps)
 {
 	/*
 	 * Erratum affected cpus will read 256 as the minimum
@@ -1362,8 +1370,8 @@ static void etm4_fixup_wrong_ccitmin(struct etmv4_drvdata *drvdata)
 	 * this problem.
 	 */
 	if (is_midr_in_range_list(etm_wrong_ccitmin_cpus)) {
-		if (drvdata->ccitmin == 256)
-			drvdata->ccitmin = 4;
+		if (caps->ccitmin == 256)
+			caps->ccitmin = 4;
 	}
 }
 
@@ -1376,11 +1384,13 @@ static void etm4_init_arch_data(void *info)
 	u32 etmidr5;
 	struct etm4_init_arg *init_arg = info;
 	struct etmv4_drvdata *drvdata;
+	struct etmv4_caps *caps;
 	struct csdev_access *csa;
 	struct device *dev = init_arg->dev;
 	int i;
 
 	drvdata = dev_get_drvdata(init_arg->dev);
+	caps = &drvdata->caps;
 	csa = init_arg->csa;
 
 	/*
@@ -1393,7 +1403,7 @@ static void etm4_init_arch_data(void *info)
 
 	if (!csa->io_mem ||
 	    fwnode_property_present(dev_fwnode(dev), "qcom,skip-power-up"))
-		drvdata->skip_power_up = true;
+		caps->skip_power_up = true;
 
 	/* Detect the support for OS Lock before we actually use it */
 	etm_detect_os_lock(drvdata, csa);
@@ -1408,71 +1418,71 @@ static void etm4_init_arch_data(void *info)
 	etmidr0 = etm4x_relaxed_read32(csa, TRCIDR0);
 
 	/* INSTP0, bits[2:1] P0 tracing support field */
-	drvdata->instrp0 = !!(FIELD_GET(TRCIDR0_INSTP0_MASK, etmidr0) == 0b11);
+	caps->instrp0 = !!(FIELD_GET(TRCIDR0_INSTP0_MASK, etmidr0) == 0b11);
 	/* TRCBB, bit[5] Branch broadcast tracing support bit */
-	drvdata->trcbb = !!(etmidr0 & TRCIDR0_TRCBB);
+	caps->trcbb = !!(etmidr0 & TRCIDR0_TRCBB);
 	/* TRCCOND, bit[6] Conditional instruction tracing support bit */
-	drvdata->trccond = !!(etmidr0 & TRCIDR0_TRCCOND);
+	caps->trccond = !!(etmidr0 & TRCIDR0_TRCCOND);
 	/* TRCCCI, bit[7] Cycle counting instruction bit */
-	drvdata->trccci = !!(etmidr0 & TRCIDR0_TRCCCI);
+	caps->trccci = !!(etmidr0 & TRCIDR0_TRCCCI);
 	/* RETSTACK, bit[9] Return stack bit */
-	drvdata->retstack = !!(etmidr0 & TRCIDR0_RETSTACK);
+	caps->retstack = !!(etmidr0 & TRCIDR0_RETSTACK);
 	/* NUMEVENT, bits[11:10] Number of events field */
-	drvdata->nr_event = FIELD_GET(TRCIDR0_NUMEVENT_MASK, etmidr0);
+	caps->nr_event = FIELD_GET(TRCIDR0_NUMEVENT_MASK, etmidr0);
 	/* QSUPP, bits[16:15] Q element support field */
-	drvdata->q_support = FIELD_GET(TRCIDR0_QSUPP_MASK, etmidr0);
-	if (drvdata->q_support)
-		drvdata->q_filt = !!(etmidr0 & TRCIDR0_QFILT);
+	caps->q_support = FIELD_GET(TRCIDR0_QSUPP_MASK, etmidr0);
+	if (caps->q_support)
+		caps->q_filt = !!(etmidr0 & TRCIDR0_QFILT);
 	/* TSSIZE, bits[28:24] Global timestamp size field */
-	drvdata->ts_size = FIELD_GET(TRCIDR0_TSSIZE_MASK, etmidr0);
+	caps->ts_size = FIELD_GET(TRCIDR0_TSSIZE_MASK, etmidr0);
 
 	/* maximum size of resources */
 	etmidr2 = etm4x_relaxed_read32(csa, TRCIDR2);
 	/* CIDSIZE, bits[9:5] Indicates the Context ID size */
-	drvdata->ctxid_size = FIELD_GET(TRCIDR2_CIDSIZE_MASK, etmidr2);
+	caps->ctxid_size = FIELD_GET(TRCIDR2_CIDSIZE_MASK, etmidr2);
 	/* VMIDSIZE, bits[14:10] Indicates the VMID size */
-	drvdata->vmid_size = FIELD_GET(TRCIDR2_VMIDSIZE_MASK, etmidr2);
+	caps->vmid_size = FIELD_GET(TRCIDR2_VMIDSIZE_MASK, etmidr2);
 	/* CCSIZE, bits[28:25] size of the cycle counter in bits minus 12 */
-	drvdata->ccsize = FIELD_GET(TRCIDR2_CCSIZE_MASK, etmidr2);
+	caps->ccsize = FIELD_GET(TRCIDR2_CCSIZE_MASK, etmidr2);
 
 	etmidr3 = etm4x_relaxed_read32(csa, TRCIDR3);
 	/* CCITMIN, bits[11:0] minimum threshold value that can be programmed */
-	drvdata->ccitmin = FIELD_GET(TRCIDR3_CCITMIN_MASK, etmidr3);
-	etm4_fixup_wrong_ccitmin(drvdata);
+	caps->ccitmin = FIELD_GET(TRCIDR3_CCITMIN_MASK, etmidr3);
+	etm4_fixup_wrong_ccitmin(caps);
 
 	/* EXLEVEL_S, bits[19:16] Secure state instruction tracing */
-	drvdata->s_ex_level = FIELD_GET(TRCIDR3_EXLEVEL_S_MASK, etmidr3);
-	drvdata->config.s_ex_level = drvdata->s_ex_level;
+	caps->s_ex_level = FIELD_GET(TRCIDR3_EXLEVEL_S_MASK, etmidr3);
+	drvdata->config.s_ex_level = caps->s_ex_level;
 	/* EXLEVEL_NS, bits[23:20] Non-secure state instruction tracing */
-	drvdata->ns_ex_level = FIELD_GET(TRCIDR3_EXLEVEL_NS_MASK, etmidr3);
+	caps->ns_ex_level = FIELD_GET(TRCIDR3_EXLEVEL_NS_MASK, etmidr3);
 	/*
 	 * TRCERR, bit[24] whether a trace unit can trace a
 	 * system error exception.
 	 */
-	drvdata->trc_error = !!(etmidr3 & TRCIDR3_TRCERR);
+	caps->trc_error = !!(etmidr3 & TRCIDR3_TRCERR);
 	/* SYNCPR, bit[25] implementation has a fixed synchronization period? */
-	drvdata->syncpr = !!(etmidr3 & TRCIDR3_SYNCPR);
+	caps->syncpr = !!(etmidr3 & TRCIDR3_SYNCPR);
 	/* STALLCTL, bit[26] is stall control implemented? */
-	drvdata->stallctl = !!(etmidr3 & TRCIDR3_STALLCTL);
+	caps->stallctl = !!(etmidr3 & TRCIDR3_STALLCTL);
 	/* SYSSTALL, bit[27] implementation can support stall control? */
-	drvdata->sysstall = !!(etmidr3 & TRCIDR3_SYSSTALL);
+	caps->sysstall = !!(etmidr3 & TRCIDR3_SYSSTALL);
 	/*
 	 * NUMPROC - the number of PEs available for tracing, 5bits
 	 *         = TRCIDR3.bits[13:12]bits[30:28]
 	 *  bits[4:3] = TRCIDR3.bits[13:12] (since etm-v4.2, otherwise RES0)
 	 *  bits[3:0] = TRCIDR3.bits[30:28]
 	 */
-	drvdata->nr_pe =  (FIELD_GET(TRCIDR3_NUMPROC_HI_MASK, etmidr3) << 3) |
-			   FIELD_GET(TRCIDR3_NUMPROC_LO_MASK, etmidr3);
+	caps->nr_pe = (FIELD_GET(TRCIDR3_NUMPROC_HI_MASK, etmidr3) << 3) |
+			       FIELD_GET(TRCIDR3_NUMPROC_LO_MASK, etmidr3);
 	/* NOOVERFLOW, bit[31] is trace overflow prevention supported */
-	drvdata->nooverflow = !!(etmidr3 & TRCIDR3_NOOVERFLOW);
+	caps->nooverflow = !!(etmidr3 & TRCIDR3_NOOVERFLOW);
 
 	/* number of resources trace unit supports */
 	etmidr4 = etm4x_relaxed_read32(csa, TRCIDR4);
 	/* NUMACPAIRS, bits[0:3] number of addr comparator pairs for tracing */
-	drvdata->nr_addr_cmp = FIELD_GET(TRCIDR4_NUMACPAIRS_MASK, etmidr4);
+	caps->nr_addr_cmp = FIELD_GET(TRCIDR4_NUMACPAIRS_MASK, etmidr4);
 	/* NUMPC, bits[15:12] number of PE comparator inputs for tracing */
-	drvdata->nr_pe_cmp = FIELD_GET(TRCIDR4_NUMPC_MASK, etmidr4);
+	caps->nr_pe_cmp = FIELD_GET(TRCIDR4_NUMPC_MASK, etmidr4);
 	/*
 	 * NUMRSPAIR, bits[19:16]
 	 * The number of resource pairs conveyed by the HW starts at 0, i.e a
@@ -1483,41 +1493,41 @@ static void etm4_init_arch_data(void *info)
 	 * the default TRUE and FALSE resource selectors are omitted.
 	 * Otherwise for values 0x1 and above the number is N + 1 as per v4.2.
 	 */
-	drvdata->nr_resource = FIELD_GET(TRCIDR4_NUMRSPAIR_MASK, etmidr4);
-	if ((drvdata->arch < ETM_ARCH_V4_3) || (drvdata->nr_resource > 0))
-		drvdata->nr_resource += 1;
+	caps->nr_resource = FIELD_GET(TRCIDR4_NUMRSPAIR_MASK, etmidr4);
+	if ((drvdata->arch < ETM_ARCH_V4_3) || (caps->nr_resource > 0))
+		caps->nr_resource += 1;
 	/*
 	 * NUMSSCC, bits[23:20] the number of single-shot
 	 * comparator control for tracing. Read any status regs as these
 	 * also contain RO capability data.
 	 */
-	drvdata->nr_ss_cmp = FIELD_GET(TRCIDR4_NUMSSCC_MASK, etmidr4);
-	for (i = 0; i < drvdata->nr_ss_cmp; i++) {
+	caps->nr_ss_cmp = FIELD_GET(TRCIDR4_NUMSSCC_MASK, etmidr4);
+	for (i = 0; i < caps->nr_ss_cmp; i++) {
 		drvdata->config.ss_status[i] =
 			etm4x_relaxed_read32(csa, TRCSSCSRn(i));
 	}
 	/* NUMCIDC, bits[27:24] number of Context ID comparators for tracing */
-	drvdata->numcidc = FIELD_GET(TRCIDR4_NUMCIDC_MASK, etmidr4);
+	caps->numcidc = FIELD_GET(TRCIDR4_NUMCIDC_MASK, etmidr4);
 	/* NUMVMIDC, bits[31:28] number of VMID comparators for tracing */
-	drvdata->numvmidc = FIELD_GET(TRCIDR4_NUMVMIDC_MASK, etmidr4);
+	caps->numvmidc = FIELD_GET(TRCIDR4_NUMVMIDC_MASK, etmidr4);
 
 	etmidr5 = etm4x_relaxed_read32(csa, TRCIDR5);
 	/* NUMEXTIN, bits[8:0] number of external inputs implemented */
-	drvdata->nr_ext_inp = FIELD_GET(TRCIDR5_NUMEXTIN_MASK, etmidr5);
-	drvdata->numextinsel = FIELD_GET(TRCIDR5_NUMEXTINSEL_MASK, etmidr5);
+	caps->nr_ext_inp = FIELD_GET(TRCIDR5_NUMEXTIN_MASK, etmidr5);
+	caps->numextinsel = FIELD_GET(TRCIDR5_NUMEXTINSEL_MASK, etmidr5);
 	/* TRACEIDSIZE, bits[21:16] indicates the trace ID width */
-	drvdata->trcid_size = FIELD_GET(TRCIDR5_TRACEIDSIZE_MASK, etmidr5);
+	caps->trcid_size = FIELD_GET(TRCIDR5_TRACEIDSIZE_MASK, etmidr5);
 	/* ATBTRIG, bit[22] implementation can support ATB triggers? */
-	drvdata->atbtrig = !!(etmidr5 & TRCIDR5_ATBTRIG);
+	caps->atbtrig = !!(etmidr5 & TRCIDR5_ATBTRIG);
 	/*
 	 * LPOVERRIDE, bit[23] implementation supports
 	 * low-power state override
 	 */
-	drvdata->lpoverride = (etmidr5 & TRCIDR5_LPOVERRIDE) && (!drvdata->skip_power_up);
+	caps->lpoverride = (etmidr5 & TRCIDR5_LPOVERRIDE) && (!caps->skip_power_up);
 	/* NUMSEQSTATE, bits[27:25] number of sequencer states implemented */
-	drvdata->nrseqstate = FIELD_GET(TRCIDR5_NUMSEQSTATE_MASK, etmidr5);
+	caps->nrseqstate = FIELD_GET(TRCIDR5_NUMSEQSTATE_MASK, etmidr5);
 	/* NUMCNTR, bits[30:28] number of counters available for tracing */
-	drvdata->nr_cntr = FIELD_GET(TRCIDR5_NUMCNTR_MASK, etmidr5);
+	caps->nr_cntr = FIELD_GET(TRCIDR5_NUMCNTR_MASK, etmidr5);
 
 	coresight_clear_self_claim_tag_unlocked(csa);
 	etm4_cs_lock(drvdata, csa);
@@ -1693,7 +1703,7 @@ static int etm4_get_next_comparator(struct etmv4_drvdata *drvdata, u32 type)
 	 * nr_addr_cmp holds the number of comparator _pair_, so time 2
 	 * for the total number of comparators.
 	 */
-	nr_comparator = drvdata->nr_addr_cmp * 2;
+	nr_comparator = drvdata->caps.nr_addr_cmp * 2;
 
 	/* Go through the tally of comparators looking for a free one. */
 	while (index < nr_comparator) {
@@ -1871,6 +1881,7 @@ static int etm4_dying_cpu(unsigned int cpu)
 static int __etm4_cpu_save(struct etmv4_drvdata *drvdata)
 {
 	int i, ret = 0;
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_save_state *state;
 	struct coresight_device *csdev = drvdata->csdev;
 	struct csdev_access *csa;
@@ -1907,57 +1918,57 @@ static int __etm4_cpu_save(struct etmv4_drvdata *drvdata)
 
 	state = drvdata->save_state;
 
-	if (drvdata->nr_pe)
+	if (caps->nr_pe)
 		state->trcprocselr = etm4x_read32(csa, TRCPROCSELR);
 	state->trcconfigr = etm4x_read32(csa, TRCCONFIGR);
 	state->trcauxctlr = etm4x_read32(csa, TRCAUXCTLR);
 	state->trceventctl0r = etm4x_read32(csa, TRCEVENTCTL0R);
 	state->trceventctl1r = etm4x_read32(csa, TRCEVENTCTL1R);
-	if (drvdata->stallctl)
+	if (caps->stallctl)
 		state->trcstallctlr = etm4x_read32(csa, TRCSTALLCTLR);
 	state->trctsctlr = etm4x_read32(csa, TRCTSCTLR);
 	state->trcsyncpr = etm4x_read32(csa, TRCSYNCPR);
 	state->trcccctlr = etm4x_read32(csa, TRCCCCTLR);
 	state->trcbbctlr = etm4x_read32(csa, TRCBBCTLR);
 	state->trctraceidr = etm4x_read32(csa, TRCTRACEIDR);
-	if (drvdata->q_filt)
+	if (caps->q_filt)
 		state->trcqctlr = etm4x_read32(csa, TRCQCTLR);
 
 	state->trcvictlr = etm4x_read32(csa, TRCVICTLR);
 	state->trcviiectlr = etm4x_read32(csa, TRCVIIECTLR);
 	state->trcvissctlr = etm4x_read32(csa, TRCVISSCTLR);
-	if (drvdata->nr_pe_cmp)
+	if (caps->nr_pe_cmp)
 		state->trcvipcssctlr = etm4x_read32(csa, TRCVIPCSSCTLR);
 
-	if (drvdata->nrseqstate) {
-		for (i = 0; i < drvdata->nrseqstate - 1; i++)
+	if (caps->nrseqstate) {
+		for (i = 0; i < caps->nrseqstate - 1; i++)
 			state->trcseqevr[i] = etm4x_read32(csa, TRCSEQEVRn(i));
 
 		state->trcseqrstevr = etm4x_read32(csa, TRCSEQRSTEVR);
 		state->trcseqstr = etm4x_read32(csa, TRCSEQSTR);
 	}
 
-	if (drvdata->numextinsel)
+	if (caps->numextinsel)
 		state->trcextinselr = etm4x_read32(csa, TRCEXTINSELR);
 
-	for (i = 0; i < drvdata->nr_cntr; i++) {
+	for (i = 0; i < caps->nr_cntr; i++) {
 		state->trccntrldvr[i] = etm4x_read32(csa, TRCCNTRLDVRn(i));
 		state->trccntctlr[i] = etm4x_read32(csa, TRCCNTCTLRn(i));
 		state->trccntvr[i] = etm4x_read32(csa, TRCCNTVRn(i));
 	}
 
 	/* Resource selector pair 0 is reserved */
-	for (i = 2; i < drvdata->nr_resource * 2; i++)
+	for (i = 2; i < caps->nr_resource * 2; i++)
 		state->trcrsctlr[i] = etm4x_read32(csa, TRCRSCTLRn(i));
 
-	for (i = 0; i < drvdata->nr_ss_cmp; i++) {
+	for (i = 0; i < caps->nr_ss_cmp; i++) {
 		state->trcssccr[i] = etm4x_read32(csa, TRCSSCCRn(i));
 		state->trcsscsr[i] = etm4x_read32(csa, TRCSSCSRn(i));
 		if (etm4x_sspcicrn_present(drvdata, i))
 			state->trcsspcicr[i] = etm4x_read32(csa, TRCSSPCICRn(i));
 	}
 
-	for (i = 0; i < drvdata->nr_addr_cmp * 2; i++) {
+	for (i = 0; i < caps->nr_addr_cmp * 2; i++) {
 		state->trcacvr[i] = etm4x_read64(csa, TRCACVRn(i));
 		state->trcacatr[i] = etm4x_read64(csa, TRCACATRn(i));
 	}
@@ -1969,23 +1980,23 @@ static int __etm4_cpu_save(struct etmv4_drvdata *drvdata)
 	 * unit") of ARM IHI 0064D.
 	 */
 
-	for (i = 0; i < drvdata->numcidc; i++)
+	for (i = 0; i < caps->numcidc; i++)
 		state->trccidcvr[i] = etm4x_read64(csa, TRCCIDCVRn(i));
 
-	for (i = 0; i < drvdata->numvmidc; i++)
+	for (i = 0; i < caps->numvmidc; i++)
 		state->trcvmidcvr[i] = etm4x_read64(csa, TRCVMIDCVRn(i));
 
 	state->trccidcctlr0 = etm4x_read32(csa, TRCCIDCCTLR0);
-	if (drvdata->numcidc > 4)
+	if (caps->numcidc > 4)
 		state->trccidcctlr1 = etm4x_read32(csa, TRCCIDCCTLR1);
 
 	state->trcvmidcctlr0 = etm4x_read32(csa, TRCVMIDCCTLR0);
-	if (drvdata->numvmidc > 4)
+	if (caps->numvmidc > 4)
 		state->trcvmidcctlr0 = etm4x_read32(csa, TRCVMIDCCTLR1);
 
 	state->trcclaimset = etm4x_read32(csa, TRCCLAIMCLR);
 
-	if (!drvdata->skip_power_up)
+	if (!caps->skip_power_up)
 		state->trcpdcr = etm4x_read32(csa, TRCPDCR);
 
 	/* wait for TRCSTATR.IDLE to go up */
@@ -2002,7 +2013,7 @@ static int __etm4_cpu_save(struct etmv4_drvdata *drvdata)
 	 * potentially save power on systems that respect the TRCPDCR_PU
 	 * despite requesting software to save/restore state.
 	 */
-	if (!drvdata->skip_power_up)
+	if (!caps->skip_power_up)
 		etm4x_relaxed_write32(csa, (state->trcpdcr & ~TRCPDCR_PU),
 				      TRCPDCR);
 out:
@@ -2029,6 +2040,7 @@ static int etm4_cpu_save(struct etmv4_drvdata *drvdata)
 static void __etm4_cpu_restore(struct etmv4_drvdata *drvdata)
 {
 	int i;
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_save_state *state = drvdata->save_state;
 	struct csdev_access *csa = &drvdata->csdev->access;
 
@@ -2038,77 +2050,77 @@ static void __etm4_cpu_restore(struct etmv4_drvdata *drvdata)
 	etm4_cs_unlock(drvdata, csa);
 	etm4x_relaxed_write32(csa, state->trcclaimset, TRCCLAIMSET);
 
-	if (drvdata->nr_pe)
+	if (caps->nr_pe)
 		etm4x_relaxed_write32(csa, state->trcprocselr, TRCPROCSELR);
 	etm4x_relaxed_write32(csa, state->trcconfigr, TRCCONFIGR);
 	etm4x_relaxed_write32(csa, state->trcauxctlr, TRCAUXCTLR);
 	etm4x_relaxed_write32(csa, state->trceventctl0r, TRCEVENTCTL0R);
 	etm4x_relaxed_write32(csa, state->trceventctl1r, TRCEVENTCTL1R);
-	if (drvdata->stallctl)
+	if (caps->stallctl)
 		etm4x_relaxed_write32(csa, state->trcstallctlr, TRCSTALLCTLR);
 	etm4x_relaxed_write32(csa, state->trctsctlr, TRCTSCTLR);
 	etm4x_relaxed_write32(csa, state->trcsyncpr, TRCSYNCPR);
 	etm4x_relaxed_write32(csa, state->trcccctlr, TRCCCCTLR);
 	etm4x_relaxed_write32(csa, state->trcbbctlr, TRCBBCTLR);
 	etm4x_relaxed_write32(csa, state->trctraceidr, TRCTRACEIDR);
-	if (drvdata->q_filt)
+	if (caps->q_filt)
 		etm4x_relaxed_write32(csa, state->trcqctlr, TRCQCTLR);
 
 	etm4x_relaxed_write32(csa, state->trcvictlr, TRCVICTLR);
 	etm4x_relaxed_write32(csa, state->trcviiectlr, TRCVIIECTLR);
 	etm4x_relaxed_write32(csa, state->trcvissctlr, TRCVISSCTLR);
-	if (drvdata->nr_pe_cmp)
+	if (caps->nr_pe_cmp)
 		etm4x_relaxed_write32(csa, state->trcvipcssctlr, TRCVIPCSSCTLR);
 
-	if (drvdata->nrseqstate) {
-		for (i = 0; i < drvdata->nrseqstate - 1; i++)
+	if (caps->nrseqstate) {
+		for (i = 0; i < caps->nrseqstate - 1; i++)
 			etm4x_relaxed_write32(csa, state->trcseqevr[i], TRCSEQEVRn(i));
 
 		etm4x_relaxed_write32(csa, state->trcseqrstevr, TRCSEQRSTEVR);
 		etm4x_relaxed_write32(csa, state->trcseqstr, TRCSEQSTR);
 	}
-	if (drvdata->numextinsel)
+	if (caps->numextinsel)
 		etm4x_relaxed_write32(csa, state->trcextinselr, TRCEXTINSELR);
 
-	for (i = 0; i < drvdata->nr_cntr; i++) {
+	for (i = 0; i < caps->nr_cntr; i++) {
 		etm4x_relaxed_write32(csa, state->trccntrldvr[i], TRCCNTRLDVRn(i));
 		etm4x_relaxed_write32(csa, state->trccntctlr[i], TRCCNTCTLRn(i));
 		etm4x_relaxed_write32(csa, state->trccntvr[i], TRCCNTVRn(i));
 	}
 
 	/* Resource selector pair 0 is reserved */
-	for (i = 2; i < drvdata->nr_resource * 2; i++)
+	for (i = 2; i < caps->nr_resource * 2; i++)
 		etm4x_relaxed_write32(csa, state->trcrsctlr[i], TRCRSCTLRn(i));
 
-	for (i = 0; i < drvdata->nr_ss_cmp; i++) {
+	for (i = 0; i < caps->nr_ss_cmp; i++) {
 		etm4x_relaxed_write32(csa, state->trcssccr[i], TRCSSCCRn(i));
 		etm4x_relaxed_write32(csa, state->trcsscsr[i], TRCSSCSRn(i));
 		if (etm4x_sspcicrn_present(drvdata, i))
 			etm4x_relaxed_write32(csa, state->trcsspcicr[i], TRCSSPCICRn(i));
 	}
 
-	for (i = 0; i < drvdata->nr_addr_cmp * 2; i++) {
+	for (i = 0; i < caps->nr_addr_cmp * 2; i++) {
 		etm4x_relaxed_write64(csa, state->trcacvr[i], TRCACVRn(i));
 		etm4x_relaxed_write64(csa, state->trcacatr[i], TRCACATRn(i));
 	}
 
-	for (i = 0; i < drvdata->numcidc; i++)
+	for (i = 0; i < caps->numcidc; i++)
 		etm4x_relaxed_write64(csa, state->trccidcvr[i], TRCCIDCVRn(i));
 
-	for (i = 0; i < drvdata->numvmidc; i++)
+	for (i = 0; i < caps->numvmidc; i++)
 		etm4x_relaxed_write64(csa, state->trcvmidcvr[i], TRCVMIDCVRn(i));
 
 	etm4x_relaxed_write32(csa, state->trccidcctlr0, TRCCIDCCTLR0);
-	if (drvdata->numcidc > 4)
+	if (caps->numcidc > 4)
 		etm4x_relaxed_write32(csa, state->trccidcctlr1, TRCCIDCCTLR1);
 
 	etm4x_relaxed_write32(csa, state->trcvmidcctlr0, TRCVMIDCCTLR0);
-	if (drvdata->numvmidc > 4)
+	if (caps->numvmidc > 4)
 		etm4x_relaxed_write32(csa, state->trcvmidcctlr0, TRCVMIDCCTLR1);
 
 	etm4x_relaxed_write32(csa, state->trcclaimset, TRCCLAIMSET);
 
-	if (!drvdata->skip_power_up)
+	if (!caps->skip_power_up)
 		etm4x_relaxed_write32(csa, state->trcpdcr, TRCPDCR);
 
 	/*
diff --git a/drivers/hwtracing/coresight/coresight-etm4x-sysfs.c b/drivers/hwtracing/coresight/coresight-etm4x-sysfs.c
index d866dcfa2a36..8bd28e71d4c9 100644
--- a/drivers/hwtracing/coresight/coresight-etm4x-sysfs.c
+++ b/drivers/hwtracing/coresight/coresight-etm4x-sysfs.c
@@ -62,8 +62,9 @@ static ssize_t nr_pe_cmp_show(struct device *dev,
 {
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 
-	val = drvdata->nr_pe_cmp;
+	val = caps->nr_pe_cmp;
 	return scnprintf(buf, PAGE_SIZE, "%#lx\n", val);
 }
 static DEVICE_ATTR_RO(nr_pe_cmp);
@@ -74,8 +75,9 @@ static ssize_t nr_addr_cmp_show(struct device *dev,
 {
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 
-	val = drvdata->nr_addr_cmp;
+	val = caps->nr_addr_cmp;
 	return scnprintf(buf, PAGE_SIZE, "%#lx\n", val);
 }
 static DEVICE_ATTR_RO(nr_addr_cmp);
@@ -86,8 +88,9 @@ static ssize_t nr_cntr_show(struct device *dev,
 {
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 
-	val = drvdata->nr_cntr;
+	val = caps->nr_cntr;
 	return scnprintf(buf, PAGE_SIZE, "%#lx\n", val);
 }
 static DEVICE_ATTR_RO(nr_cntr);
@@ -98,8 +101,9 @@ static ssize_t nr_ext_inp_show(struct device *dev,
 {
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 
-	val = drvdata->nr_ext_inp;
+	val = caps->nr_ext_inp;
 	return scnprintf(buf, PAGE_SIZE, "%#lx\n", val);
 }
 static DEVICE_ATTR_RO(nr_ext_inp);
@@ -110,8 +114,9 @@ static ssize_t numcidc_show(struct device *dev,
 {
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 
-	val = drvdata->numcidc;
+	val = caps->numcidc;
 	return scnprintf(buf, PAGE_SIZE, "%#lx\n", val);
 }
 static DEVICE_ATTR_RO(numcidc);
@@ -122,8 +127,9 @@ static ssize_t numvmidc_show(struct device *dev,
 {
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 
-	val = drvdata->numvmidc;
+	val = caps->numvmidc;
 	return scnprintf(buf, PAGE_SIZE, "%#lx\n", val);
 }
 static DEVICE_ATTR_RO(numvmidc);
@@ -134,8 +140,9 @@ static ssize_t nrseqstate_show(struct device *dev,
 {
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 
-	val = drvdata->nrseqstate;
+	val = caps->nrseqstate;
 	return scnprintf(buf, PAGE_SIZE, "%#lx\n", val);
 }
 static DEVICE_ATTR_RO(nrseqstate);
@@ -146,8 +153,9 @@ static ssize_t nr_resource_show(struct device *dev,
 {
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 
-	val = drvdata->nr_resource;
+	val = caps->nr_resource;
 	return scnprintf(buf, PAGE_SIZE, "%#lx\n", val);
 }
 static DEVICE_ATTR_RO(nr_resource);
@@ -158,8 +166,9 @@ static ssize_t nr_ss_cmp_show(struct device *dev,
 {
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 
-	val = drvdata->nr_ss_cmp;
+	val = caps->nr_ss_cmp;
 	return scnprintf(buf, PAGE_SIZE, "%#lx\n", val);
 }
 static DEVICE_ATTR_RO(nr_ss_cmp);
@@ -171,6 +180,7 @@ static ssize_t reset_store(struct device *dev,
 	int i;
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 
 	if (kstrtoul(buf, 16, &val))
@@ -200,7 +210,7 @@ static ssize_t reset_store(struct device *dev,
 	config->stall_ctrl = 0x0;
 
 	/* Reset trace synchronization period  to 2^8 = 256 bytes*/
-	if (drvdata->syncpr == false)
+	if (!caps->syncpr)
 		config->syncfreq = 0x8;
 
 	/*
@@ -209,7 +219,7 @@ static ssize_t reset_store(struct device *dev,
 	 * each trace run.
 	 */
 	config->vinst_ctrl = FIELD_PREP(TRCVICTLR_EVENT_MASK, 0x01);
-	if (drvdata->nr_addr_cmp > 0) {
+	if (caps->nr_addr_cmp > 0) {
 		config->mode |= ETM_MODE_VIEWINST_STARTSTOP;
 		/* SSSTATUS, bit[9] */
 		config->vinst_ctrl |= TRCVICTLR_SSSTATUS;
@@ -223,7 +233,7 @@ static ssize_t reset_store(struct device *dev,
 	config->vipcssctlr = 0x0;
 
 	/* Disable seq events */
-	for (i = 0; i < drvdata->nrseqstate-1; i++)
+	for (i = 0; i < caps->nrseqstate - 1; i++)
 		config->seq_ctrl[i] = 0x0;
 	config->seq_rst = 0x0;
 	config->seq_state = 0x0;
@@ -232,38 +242,38 @@ static ssize_t reset_store(struct device *dev,
 	config->ext_inp = 0x0;
 
 	config->cntr_idx = 0x0;
-	for (i = 0; i < drvdata->nr_cntr; i++) {
+	for (i = 0; i < caps->nr_cntr; i++) {
 		config->cntrldvr[i] = 0x0;
 		config->cntr_ctrl[i] = 0x0;
 		config->cntr_val[i] = 0x0;
 	}
 
 	config->res_idx = 0x0;
-	for (i = 2; i < 2 * drvdata->nr_resource; i++)
+	for (i = 2; i < 2 * caps->nr_resource; i++)
 		config->res_ctrl[i] = 0x0;
 
 	config->ss_idx = 0x0;
-	for (i = 0; i < drvdata->nr_ss_cmp; i++) {
+	for (i = 0; i < caps->nr_ss_cmp; i++) {
 		config->ss_ctrl[i] = 0x0;
 		config->ss_pe_cmp[i] = 0x0;
 	}
 
 	config->addr_idx = 0x0;
-	for (i = 0; i < drvdata->nr_addr_cmp * 2; i++) {
+	for (i = 0; i < caps->nr_addr_cmp * 2; i++) {
 		config->addr_val[i] = 0x0;
 		config->addr_acc[i] = 0x0;
 		config->addr_type[i] = ETM_ADDR_TYPE_NONE;
 	}
 
 	config->ctxid_idx = 0x0;
-	for (i = 0; i < drvdata->numcidc; i++)
+	for (i = 0; i < caps->numcidc; i++)
 		config->ctxid_pid[i] = 0x0;
 
 	config->ctxid_mask0 = 0x0;
 	config->ctxid_mask1 = 0x0;
 
 	config->vmid_idx = 0x0;
-	for (i = 0; i < drvdata->numvmidc; i++)
+	for (i = 0; i < caps->numvmidc; i++)
 		config->vmid_val[i] = 0x0;
 	config->vmid_mask0 = 0x0;
 	config->vmid_mask1 = 0x0;
@@ -297,6 +307,7 @@ static ssize_t mode_store(struct device *dev,
 {
 	unsigned long val, mode;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 
 	if (kstrtoul(buf, 16, &val))
@@ -305,7 +316,7 @@ static ssize_t mode_store(struct device *dev,
 	raw_spin_lock(&drvdata->spinlock);
 	config->mode = val & ETMv4_MODE_ALL;
 
-	if (drvdata->instrp0 == true) {
+	if (caps->instrp0) {
 		/* start by clearing instruction P0 field */
 		config->cfg  &= ~TRCCONFIGR_INSTP0_LOAD_STORE;
 		if (config->mode & ETM_MODE_LOAD)
@@ -323,45 +334,44 @@ static ssize_t mode_store(struct device *dev,
 	}
 
 	/* bit[3], Branch broadcast mode */
-	if ((config->mode & ETM_MODE_BB) && (drvdata->trcbb == true))
+	if ((config->mode & ETM_MODE_BB) && (caps->trcbb))
 		config->cfg |= TRCCONFIGR_BB;
 	else
 		config->cfg &= ~TRCCONFIGR_BB;
 
 	/* bit[4], Cycle counting instruction trace bit */
 	if ((config->mode & ETMv4_MODE_CYCACC) &&
-		(drvdata->trccci == true))
+		(caps->trccci == true))
 		config->cfg |= TRCCONFIGR_CCI;
 	else
 		config->cfg &= ~TRCCONFIGR_CCI;
 
 	/* bit[6], Context ID tracing bit */
-	if ((config->mode & ETMv4_MODE_CTXID) && (drvdata->ctxid_size))
+	if ((config->mode & ETMv4_MODE_CTXID) && (caps->ctxid_size))
 		config->cfg |= TRCCONFIGR_CID;
 	else
 		config->cfg &= ~TRCCONFIGR_CID;
 
-	if ((config->mode & ETM_MODE_VMID) && (drvdata->vmid_size))
+	if ((config->mode & ETM_MODE_VMID) && (caps->vmid_size))
 		config->cfg |= TRCCONFIGR_VMID;
 	else
 		config->cfg &= ~TRCCONFIGR_VMID;
 
 	/* bits[10:8], Conditional instruction tracing bit */
 	mode = ETM_MODE_COND(config->mode);
-	if (drvdata->trccond == true) {
+	if (caps->trccond) {
 		config->cfg &= ~TRCCONFIGR_COND_MASK;
 		config->cfg |= mode << __bf_shf(TRCCONFIGR_COND_MASK);
 	}
 
 	/* bit[11], Global timestamp tracing bit */
-	if ((config->mode & ETMv4_MODE_TIMESTAMP) && (drvdata->ts_size))
+	if ((config->mode & ETMv4_MODE_TIMESTAMP) && (caps->ts_size))
 		config->cfg |= TRCCONFIGR_TS;
 	else
 		config->cfg &= ~TRCCONFIGR_TS;
 
 	/* bit[12], Return stack enable bit */
-	if ((config->mode & ETM_MODE_RETURNSTACK) &&
-					(drvdata->retstack == true))
+	if ((config->mode & ETM_MODE_RETURNSTACK) && (caps->retstack))
 		config->cfg |= TRCCONFIGR_RS;
 	else
 		config->cfg &= ~TRCCONFIGR_RS;
@@ -375,31 +385,29 @@ static ssize_t mode_store(struct device *dev,
 	 * Always set the low bit for any requested mode. Valid combos are
 	 * 0b00, 0b01 and 0b11.
 	 */
-	if (mode && drvdata->q_support)
+	if (mode && caps->q_support)
 		config->cfg |= TRCCONFIGR_QE_W_COUNTS;
 	/*
 	 * if supported, Q elements with and without instruction
 	 * counts are enabled
 	 */
-	if ((mode & BIT(1)) && (drvdata->q_support & BIT(1)))
+	if ((mode & BIT(1)) && (caps->q_support & BIT(1)))
 		config->cfg |= TRCCONFIGR_QE_WO_COUNTS;
 
 	/* bit[11], AMBA Trace Bus (ATB) trigger enable bit */
-	if ((config->mode & ETM_MODE_ATB_TRIGGER) &&
-	    (drvdata->atbtrig == true))
+	if ((config->mode & ETM_MODE_ATB_TRIGGER) && (caps->atbtrig))
 		config->eventctrl1 |= TRCEVENTCTL1R_ATB;
 	else
 		config->eventctrl1 &= ~TRCEVENTCTL1R_ATB;
 
 	/* bit[12], Low-power state behavior override bit */
-	if ((config->mode & ETM_MODE_LPOVERRIDE) &&
-	    (drvdata->lpoverride == true))
+	if ((config->mode & ETM_MODE_LPOVERRIDE) && (caps->lpoverride))
 		config->eventctrl1 |= TRCEVENTCTL1R_LPOVERRIDE;
 	else
 		config->eventctrl1 &= ~TRCEVENTCTL1R_LPOVERRIDE;
 
 	/* bit[8], Instruction stall bit */
-	if ((config->mode & ETM_MODE_ISTALL_EN) && (drvdata->stallctl == true))
+	if ((config->mode & ETM_MODE_ISTALL_EN) && (caps->stallctl))
 		config->stall_ctrl |= TRCSTALLCTLR_ISTALL;
 	else
 		config->stall_ctrl &= ~TRCSTALLCTLR_ISTALL;
@@ -411,8 +419,7 @@ static ssize_t mode_store(struct device *dev,
 		config->stall_ctrl &= ~TRCSTALLCTLR_INSTPRIORITY;
 
 	/* bit[13], Trace overflow prevention bit */
-	if ((config->mode & ETM_MODE_NOOVERFLOW) &&
-		(drvdata->nooverflow == true))
+	if ((config->mode & ETM_MODE_NOOVERFLOW) && (caps->nooverflow))
 		config->stall_ctrl |= TRCSTALLCTLR_NOOVERFLOW;
 	else
 		config->stall_ctrl &= ~TRCSTALLCTLR_NOOVERFLOW;
@@ -430,8 +437,7 @@ static ssize_t mode_store(struct device *dev,
 		config->vinst_ctrl &= ~TRCVICTLR_TRCRESET;
 
 	/* bit[11], Whether a trace unit must trace a system error exception */
-	if ((config->mode & ETM_MODE_TRACE_ERR) &&
-		(drvdata->trc_error == true))
+	if ((config->mode & ETM_MODE_TRACE_ERR) && (caps->trc_error))
 		config->vinst_ctrl |= TRCVICTLR_TRCERR;
 	else
 		config->vinst_ctrl &= ~TRCVICTLR_TRCERR;
@@ -463,13 +469,14 @@ static ssize_t pe_store(struct device *dev,
 {
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 
 	if (kstrtoul(buf, 16, &val))
 		return -EINVAL;
 
 	raw_spin_lock(&drvdata->spinlock);
-	if (val > drvdata->nr_pe) {
+	if (val > caps->nr_pe) {
 		raw_spin_unlock(&drvdata->spinlock);
 		return -EINVAL;
 	}
@@ -498,13 +505,14 @@ static ssize_t event_store(struct device *dev,
 {
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 
 	if (kstrtoul(buf, 16, &val))
 		return -EINVAL;
 
 	raw_spin_lock(&drvdata->spinlock);
-	switch (drvdata->nr_event) {
+	switch (caps->nr_event) {
 	case 0x0:
 		/* EVENT0, bits[7:0] */
 		config->eventctrl0 = val & 0xFF;
@@ -547,6 +555,7 @@ static ssize_t event_instren_store(struct device *dev,
 {
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 
 	if (kstrtoul(buf, 16, &val))
@@ -555,7 +564,7 @@ static ssize_t event_instren_store(struct device *dev,
 	raw_spin_lock(&drvdata->spinlock);
 	/* start by clearing all instruction event enable bits */
 	config->eventctrl1 &= ~TRCEVENTCTL1R_INSTEN_MASK;
-	switch (drvdata->nr_event) {
+	switch (caps->nr_event) {
 	case 0x0:
 		/* generate Event element for event 1 */
 		config->eventctrl1 |= val & TRCEVENTCTL1R_INSTEN_1;
@@ -603,11 +612,12 @@ static ssize_t event_ts_store(struct device *dev,
 {
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 
 	if (kstrtoul(buf, 16, &val))
 		return -EINVAL;
-	if (!drvdata->ts_size)
+	if (!caps->ts_size)
 		return -EINVAL;
 
 	config->ts_ctrl = val & ETMv4_EVENT_MASK;
@@ -633,11 +643,12 @@ static ssize_t syncfreq_store(struct device *dev,
 {
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 
 	if (kstrtoul(buf, 16, &val))
 		return -EINVAL;
-	if (drvdata->syncpr == true)
+	if (caps->syncpr)
 		return -EINVAL;
 
 	config->syncfreq = val & ETMv4_SYNC_MASK;
@@ -663,6 +674,7 @@ static ssize_t cyc_threshold_store(struct device *dev,
 {
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 
 	if (kstrtoul(buf, 16, &val))
@@ -670,7 +682,7 @@ static ssize_t cyc_threshold_store(struct device *dev,
 
 	/* mask off max threshold before checking min value */
 	val &= ETM_CYC_THRESHOLD_MASK;
-	if (val < drvdata->ccitmin)
+	if (val < caps->ccitmin)
 		return -EINVAL;
 
 	config->ccctlr = val;
@@ -696,13 +708,14 @@ static ssize_t bb_ctrl_store(struct device *dev,
 {
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 
 	if (kstrtoul(buf, 16, &val))
 		return -EINVAL;
-	if (drvdata->trcbb == false)
+	if (!caps->trcbb)
 		return -EINVAL;
-	if (!drvdata->nr_addr_cmp)
+	if (!caps->nr_addr_cmp)
 		return -EINVAL;
 
 	/*
@@ -768,6 +781,7 @@ static ssize_t s_exlevel_vinst_store(struct device *dev,
 {
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 
 	if (kstrtoul(buf, 16, &val))
@@ -777,7 +791,7 @@ static ssize_t s_exlevel_vinst_store(struct device *dev,
 	/* clear all EXLEVEL_S bits  */
 	config->vinst_ctrl &= ~TRCVICTLR_EXLEVEL_S_MASK;
 	/* enable instruction tracing for corresponding exception level */
-	val &= drvdata->s_ex_level;
+	val &= caps->s_ex_level;
 	config->vinst_ctrl |= val << __bf_shf(TRCVICTLR_EXLEVEL_S_MASK);
 	raw_spin_unlock(&drvdata->spinlock);
 	return size;
@@ -803,6 +817,7 @@ static ssize_t ns_exlevel_vinst_store(struct device *dev,
 {
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 
 	if (kstrtoul(buf, 16, &val))
@@ -812,7 +827,7 @@ static ssize_t ns_exlevel_vinst_store(struct device *dev,
 	/* clear EXLEVEL_NS bits  */
 	config->vinst_ctrl &= ~TRCVICTLR_EXLEVEL_NS_MASK;
 	/* enable instruction tracing for corresponding exception level */
-	val &= drvdata->ns_ex_level;
+	val &= caps->ns_ex_level;
 	config->vinst_ctrl |= val << __bf_shf(TRCVICTLR_EXLEVEL_NS_MASK);
 	raw_spin_unlock(&drvdata->spinlock);
 	return size;
@@ -837,11 +852,12 @@ static ssize_t addr_idx_store(struct device *dev,
 {
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 
 	if (kstrtoul(buf, 16, &val))
 		return -EINVAL;
-	if (val >= drvdata->nr_addr_cmp * 2)
+	if (val >= caps->nr_addr_cmp * 2)
 		return -EINVAL;
 
 	/*
@@ -1060,6 +1076,7 @@ static ssize_t addr_start_store(struct device *dev,
 	u8 idx;
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 
 	if (kstrtoul(buf, 16, &val))
@@ -1067,7 +1084,7 @@ static ssize_t addr_start_store(struct device *dev,
 
 	raw_spin_lock(&drvdata->spinlock);
 	idx = config->addr_idx;
-	if (!drvdata->nr_addr_cmp) {
+	if (!caps->nr_addr_cmp) {
 		raw_spin_unlock(&drvdata->spinlock);
 		return -EINVAL;
 	}
@@ -1115,6 +1132,7 @@ static ssize_t addr_stop_store(struct device *dev,
 	u8 idx;
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 
 	if (kstrtoul(buf, 16, &val))
@@ -1122,7 +1140,7 @@ static ssize_t addr_stop_store(struct device *dev,
 
 	raw_spin_lock(&drvdata->spinlock);
 	idx = config->addr_idx;
-	if (!drvdata->nr_addr_cmp) {
+	if (!caps->nr_addr_cmp) {
 		raw_spin_unlock(&drvdata->spinlock);
 		return -EINVAL;
 	}
@@ -1167,6 +1185,7 @@ static ssize_t addr_ctxtype_store(struct device *dev,
 	u8 idx;
 	char str[10] = "";
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 
 	if (strlen(buf) >= 10)
@@ -1181,13 +1200,13 @@ static ssize_t addr_ctxtype_store(struct device *dev,
 		config->addr_acc[idx] &= ~TRCACATRn_CONTEXTTYPE_MASK;
 	else if (!strcmp(str, "ctxid")) {
 		/* 0b01 The trace unit performs a Context ID */
-		if (drvdata->numcidc) {
+		if (caps->numcidc) {
 			config->addr_acc[idx] |= TRCACATRn_CONTEXTTYPE_CTXID;
 			config->addr_acc[idx] &= ~TRCACATRn_CONTEXTTYPE_VMID;
 		}
 	} else if (!strcmp(str, "vmid")) {
 		/* 0b10 The trace unit performs a VMID */
-		if (drvdata->numvmidc) {
+		if (caps->numvmidc) {
 			config->addr_acc[idx] &= ~TRCACATRn_CONTEXTTYPE_CTXID;
 			config->addr_acc[idx] |= TRCACATRn_CONTEXTTYPE_VMID;
 		}
@@ -1196,9 +1215,9 @@ static ssize_t addr_ctxtype_store(struct device *dev,
 		 * 0b11 The trace unit performs a Context ID
 		 * comparison and a VMID
 		 */
-		if (drvdata->numcidc)
+		if (caps->numcidc)
 			config->addr_acc[idx] |= TRCACATRn_CONTEXTTYPE_CTXID;
-		if (drvdata->numvmidc)
+		if (caps->numvmidc)
 			config->addr_acc[idx] |= TRCACATRn_CONTEXTTYPE_VMID;
 	}
 	raw_spin_unlock(&drvdata->spinlock);
@@ -1230,14 +1249,15 @@ static ssize_t addr_context_store(struct device *dev,
 	u8 idx;
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 
 	if (kstrtoul(buf, 16, &val))
 		return -EINVAL;
-	if ((drvdata->numcidc <= 1) && (drvdata->numvmidc <= 1))
+	if ((caps->numcidc <= 1) && (caps->numvmidc <= 1))
 		return -EINVAL;
-	if (val >=  (drvdata->numcidc >= drvdata->numvmidc ?
-		     drvdata->numcidc : drvdata->numvmidc))
+	if (val >=  (caps->numcidc >= caps->numvmidc ?
+		     caps->numcidc : caps->numvmidc))
 		return -EINVAL;
 
 	raw_spin_lock(&drvdata->spinlock);
@@ -1348,9 +1368,10 @@ static ssize_t vinst_pe_cmp_start_stop_show(struct device *dev,
 {
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 
-	if (!drvdata->nr_pe_cmp)
+	if (!caps->nr_pe_cmp)
 		return -EINVAL;
 	val = config->vipcssctlr;
 	return scnprintf(buf, PAGE_SIZE, "%#lx\n", val);
@@ -1361,11 +1382,12 @@ static ssize_t vinst_pe_cmp_start_stop_store(struct device *dev,
 {
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 
 	if (kstrtoul(buf, 16, &val))
 		return -EINVAL;
-	if (!drvdata->nr_pe_cmp)
+	if (!caps->nr_pe_cmp)
 		return -EINVAL;
 
 	raw_spin_lock(&drvdata->spinlock);
@@ -1393,13 +1415,14 @@ static ssize_t seq_idx_store(struct device *dev,
 {
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 
-	if (!drvdata->nrseqstate)
+	if (!caps->nrseqstate)
 		return -EINVAL;
 	if (kstrtoul(buf, 16, &val))
 		return -EINVAL;
-	if (val >= drvdata->nrseqstate - 1)
+	if (val >= caps->nrseqstate - 1)
 		return -EINVAL;
 
 	/*
@@ -1431,11 +1454,12 @@ static ssize_t seq_state_store(struct device *dev,
 {
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 
 	if (kstrtoul(buf, 16, &val))
 		return -EINVAL;
-	if (val >= drvdata->nrseqstate)
+	if (val >= caps->nrseqstate)
 		return -EINVAL;
 
 	config->seq_state = val;
@@ -1498,11 +1522,12 @@ static ssize_t seq_reset_event_store(struct device *dev,
 {
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 
 	if (kstrtoul(buf, 16, &val))
 		return -EINVAL;
-	if (!(drvdata->nrseqstate))
+	if (!(caps->nrseqstate))
 		return -EINVAL;
 
 	config->seq_rst = val & ETMv4_EVENT_MASK;
@@ -1528,11 +1553,12 @@ static ssize_t cntr_idx_store(struct device *dev,
 {
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 
 	if (kstrtoul(buf, 16, &val))
 		return -EINVAL;
-	if (val >= drvdata->nr_cntr)
+	if (val >= caps->nr_cntr)
 		return -EINVAL;
 
 	/*
@@ -1676,6 +1702,7 @@ static ssize_t res_idx_store(struct device *dev,
 {
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 
 	if (kstrtoul(buf, 16, &val))
@@ -1684,7 +1711,7 @@ static ssize_t res_idx_store(struct device *dev,
 	 * Resource selector pair 0 is always implemented and reserved,
 	 * namely an idx with 0 and 1 is illegal.
 	 */
-	if ((val < 2) || (val >= 2 * drvdata->nr_resource))
+	if ((val < 2) || (val >= 2 * caps->nr_resource))
 		return -EINVAL;
 
 	/*
@@ -1758,11 +1785,12 @@ static ssize_t sshot_idx_store(struct device *dev,
 {
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 
 	if (kstrtoul(buf, 16, &val))
 		return -EINVAL;
-	if (val >= drvdata->nr_ss_cmp)
+	if (val >= caps->nr_ss_cmp)
 		return -EINVAL;
 
 	raw_spin_lock(&drvdata->spinlock);
@@ -1876,11 +1904,12 @@ static ssize_t ctxid_idx_store(struct device *dev,
 {
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 
 	if (kstrtoul(buf, 16, &val))
 		return -EINVAL;
-	if (val >= drvdata->numcidc)
+	if (val >= caps->numcidc)
 		return -EINVAL;
 
 	/*
@@ -1924,6 +1953,7 @@ static ssize_t ctxid_pid_store(struct device *dev,
 	u8 idx;
 	unsigned long pid;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 
 	/*
@@ -1943,7 +1973,7 @@ static ssize_t ctxid_pid_store(struct device *dev,
 	 * ctxid comparator is implemented and ctxid is greater than 0 bits
 	 * in length
 	 */
-	if (!drvdata->ctxid_size || !drvdata->numcidc)
+	if (!caps->ctxid_size || !caps->numcidc)
 		return -EINVAL;
 	if (kstrtoul(buf, 16, &pid))
 		return -EINVAL;
@@ -1985,6 +2015,7 @@ static ssize_t ctxid_masks_store(struct device *dev,
 	u8 i, j, maskbyte;
 	unsigned long val1, val2, mask;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 	int nr_inputs;
 
@@ -2000,11 +2031,11 @@ static ssize_t ctxid_masks_store(struct device *dev,
 	 * ctxid comparator is implemented and ctxid is greater than 0 bits
 	 * in length
 	 */
-	if (!drvdata->ctxid_size || !drvdata->numcidc)
+	if (!caps->ctxid_size || !caps->numcidc)
 		return -EINVAL;
 	/* one mask if <= 4 comparators, two for up to 8 */
 	nr_inputs = sscanf(buf, "%lx %lx", &val1, &val2);
-	if ((drvdata->numcidc > 4) && (nr_inputs != 2))
+	if ((caps->numcidc > 4) && (nr_inputs != 2))
 		return -EINVAL;
 
 	raw_spin_lock(&drvdata->spinlock);
@@ -2012,7 +2043,7 @@ static ssize_t ctxid_masks_store(struct device *dev,
 	 * each byte[0..3] controls mask value applied to ctxid
 	 * comparator[0..3]
 	 */
-	switch (drvdata->numcidc) {
+	switch (caps->numcidc) {
 	case 0x1:
 		/* COMP0, bits[7:0] */
 		config->ctxid_mask0 = val1 & 0xFF;
@@ -2059,7 +2090,7 @@ static ssize_t ctxid_masks_store(struct device *dev,
 	 * of ctxid comparator0 value (corresponding to byte 0) register.
 	 */
 	mask = config->ctxid_mask0;
-	for (i = 0; i < drvdata->numcidc; i++) {
+	for (i = 0; i < caps->numcidc; i++) {
 		/* mask value of corresponding ctxid comparator */
 		maskbyte = mask & ETMv4_EVENT_MASK;
 		/*
@@ -2102,11 +2133,12 @@ static ssize_t vmid_idx_store(struct device *dev,
 {
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 
 	if (kstrtoul(buf, 16, &val))
 		return -EINVAL;
-	if (val >= drvdata->numvmidc)
+	if (val >= caps->numvmidc)
 		return -EINVAL;
 
 	/*
@@ -2147,6 +2179,7 @@ static ssize_t vmid_val_store(struct device *dev,
 {
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 
 	/*
@@ -2160,7 +2193,7 @@ static ssize_t vmid_val_store(struct device *dev,
 	 * only implemented when vmid tracing is enabled, i.e. at least one
 	 * vmid comparator is implemented and at least 8 bit vmid size
 	 */
-	if (!drvdata->vmid_size || !drvdata->numvmidc)
+	if (!caps->vmid_size || !caps->numvmidc)
 		return -EINVAL;
 	if (kstrtoul(buf, 16, &val))
 		return -EINVAL;
@@ -2200,6 +2233,7 @@ static ssize_t vmid_masks_store(struct device *dev,
 	u8 i, j, maskbyte;
 	unsigned long val1, val2, mask;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 	int nr_inputs;
 
@@ -2214,11 +2248,11 @@ static ssize_t vmid_masks_store(struct device *dev,
 	 * only implemented when vmid tracing is enabled, i.e. at least one
 	 * vmid comparator is implemented and at least 8 bit vmid size
 	 */
-	if (!drvdata->vmid_size || !drvdata->numvmidc)
+	if (!caps->vmid_size || !caps->numvmidc)
 		return -EINVAL;
 	/* one mask if <= 4 comparators, two for up to 8 */
 	nr_inputs = sscanf(buf, "%lx %lx", &val1, &val2);
-	if ((drvdata->numvmidc > 4) && (nr_inputs != 2))
+	if ((caps->numvmidc > 4) && (nr_inputs != 2))
 		return -EINVAL;
 
 	raw_spin_lock(&drvdata->spinlock);
@@ -2227,7 +2261,7 @@ static ssize_t vmid_masks_store(struct device *dev,
 	 * each byte[0..3] controls mask value applied to vmid
 	 * comparator[0..3]
 	 */
-	switch (drvdata->numvmidc) {
+	switch (caps->numvmidc) {
 	case 0x1:
 		/* COMP0, bits[7:0] */
 		config->vmid_mask0 = val1 & 0xFF;
@@ -2275,7 +2309,7 @@ static ssize_t vmid_masks_store(struct device *dev,
 	 * of vmid comparator0 value (corresponding to byte 0) register.
 	 */
 	mask = config->vmid_mask0;
-	for (i = 0; i < drvdata->numvmidc; i++) {
+	for (i = 0; i < caps->numvmidc; i++) {
 		/* mask value of corresponding vmid comparator */
 		maskbyte = mask & ETMv4_EVENT_MASK;
 		/*
diff --git a/drivers/hwtracing/coresight/coresight-etm4x.h b/drivers/hwtracing/coresight/coresight-etm4x.h
index 89d81ce4e04e..8168676f2945 100644
--- a/drivers/hwtracing/coresight/coresight-etm4x.h
+++ b/drivers/hwtracing/coresight/coresight-etm4x.h
@@ -812,6 +812,95 @@ enum etm_impdef_type {
 	ETM4_IMPDEF_FEATURE_MAX,
 };
 
+/**
+ * struct etmv4_caps - specifics ETM capabilities
+ * @nr_pe:	The number of processing entity available for tracing.
+ * @nr_pe_cmp:	The number of processing entity comparator inputs that are
+ *		available for tracing.
+ * @nr_addr_cmp:Number of pairs of address comparators available
+ *		as found in ETMIDR4 0-3.
+ * @nr_cntr:    Number of counters as found in ETMIDR5 bit 28-30.
+ * @nr_ext_inp: Number of external input.
+ * @numcidc:	Number of contextID comparators.
+ * @numextinsel: Number of external input selector resources.
+ * @numvmidc:	Number of VMID comparators.
+ * @nrseqstate: The number of sequencer states that are implemented.
+ * @nr_event:	Indicates how many events the trace unit support.
+ * @nr_resource:The number of resource selection pairs available for tracing.
+ * @nr_ss_cmp:	Number of single-shot comparator controls that are available.
+ * @trcid_size: Indicates the trace ID width.
+ * @ts_size:	Global timestamp size field.
+ * @ctxid_size:	Size of the context ID field to consider.
+ * @vmid_size:	Size of the VM ID comparator to consider.
+ * @ccsize:	Indicates the size of the cycle counter in bits.
+ * @ccitmin:	minimum value that can be programmed in
+ * @s_ex_level:	In secure state, indicates whether instruction tracing is
+ *		supported for the corresponding Exception level.
+ * @ns_ex_level:In non-secure state, indicates whether instruction tracing is
+ *		supported for the corresponding Exception level.
+ * @q_support:	Q element support characteristics.
+ * @os_lock_model: OSLock model.
+ * @instrp0:	Tracing of load and store instructions
+ *		as P0 elements is supported.
+ * @q_filt:	Q element filtering support, if Q elements are supported.
+ * @trcbb:	Indicates if the trace unit supports branch broadcast tracing.
+ * @trccond:	If the trace unit supports conditional
+ *		instruction tracing.
+ * @retstack:	Indicates if the implementation supports a return stack.
+ * @trccci:	Indicates if the trace unit supports cycle counting
+ *		for instruction.
+ * @trc_error:	Whether a trace unit can trace a system
+ *		error exception.
+ * @syncpr:	Indicates if an implementation has a fixed
+ *		synchronization period.
+ * @stallctl:	If functionality that prevents trace unit buffer overflows
+ *		is available.
+ * @sysstall:	Does the system support stall control of the PE?
+ * @nooverflow:	Indicate if overflow prevention is supported.
+ * @atbtrig:	If the implementation can support ATB triggers
+ * @lpoverride:	If the implementation can support low-power state over.
+ * @skip_power_up: Indicates if an implementation can skip powering up
+ *		   the trace unit.
+ */
+struct etmv4_caps {
+	u8	nr_pe;
+	u8	nr_pe_cmp;
+	u8	nr_addr_cmp;
+	u8	nr_cntr;
+	u8	nr_ext_inp;
+	u8	numcidc;
+	u8	numextinsel;
+	u8	numvmidc;
+	u8	nrseqstate;
+	u8	nr_event;
+	u8	nr_resource;
+	u8	nr_ss_cmp;
+	u8	trcid_size;
+	u8	ts_size;
+	u8	ctxid_size;
+	u8	vmid_size;
+	u8	ccsize;
+	u16	ccitmin;
+	u8	s_ex_level;
+	u8	ns_ex_level;
+	u8	q_support;
+	u8	os_lock_model;
+	bool	instrp0 : 1;
+	bool	q_filt : 1;
+	bool	trcbb : 1;
+	bool	trccond : 1;
+	bool	retstack : 1;
+	bool	trccci : 1;
+	bool	trc_error : 1;
+	bool	syncpr : 1;
+	bool	stallctl : 1;
+	bool	sysstall : 1;
+	bool	nooverflow : 1;
+	bool	atbtrig : 1;
+	bool	lpoverride : 1;
+	bool	skip_power_up : 1;
+};
+
 /**
  * struct etmv4_config - configuration information related to an ETMv4
  * @mode:	Controls various modes supported by this ETM.
@@ -819,8 +908,8 @@ enum etm_impdef_type {
  * @cfg:	Controls the tracing options.
  * @eventctrl0: Controls the tracing of arbitrary events.
  * @eventctrl1: Controls the behavior of the events that @event_ctrl0 selects.
- * @stallctl:	If functionality that prevents trace unit buffer overflows
- *		is available.
+ * @stall_ctrl:	Enables trace unit functionality that prevents trace
+ *		unit buffer overflows.
  * @ts_ctrl:	Controls the insertion of global timestamps in the
  *		trace streams.
  * @syncfreq:	Controls how often trace synchronization requests occur.
@@ -971,61 +1060,17 @@ struct etmv4_save_state {
  * @mode:	This tracer's mode, i.e sysFS, Perf or disabled.
  * @cpu:        The cpu this component is affined to.
  * @arch:       ETM architecture version.
- * @nr_pe:	The number of processing entity available for tracing.
- * @nr_pe_cmp:	The number of processing entity comparator inputs that are
- *		available for tracing.
- * @nr_addr_cmp:Number of pairs of address comparators available
- *		as found in ETMIDR4 0-3.
- * @nr_cntr:    Number of counters as found in ETMIDR5 bit 28-30.
- * @nr_ext_inp: Number of external input.
- * @numcidc:	Number of contextID comparators.
- * @numvmidc:	Number of VMID comparators.
- * @nrseqstate: The number of sequencer states that are implemented.
- * @nr_event:	Indicates how many events the trace unit support.
- * @nr_resource:The number of resource selection pairs available for tracing.
- * @nr_ss_cmp:	Number of single-shot comparator controls that are available.
+ * @caps:	ETM capabilities.
  * @trcid:	value of the current ID for this component.
- * @trcid_size: Indicates the trace ID width.
- * @ts_size:	Global timestamp size field.
- * @ctxid_size:	Size of the context ID field to consider.
- * @vmid_size:	Size of the VM ID comparator to consider.
- * @ccsize:	Indicates the size of the cycle counter in bits.
- * @ccitmin:	minimum value that can be programmed in
- * @s_ex_level:	In secure state, indicates whether instruction tracing is
- *		supported for the corresponding Exception level.
- * @ns_ex_level:In non-secure state, indicates whether instruction tracing is
- *		supported for the corresponding Exception level.
  * @sticky_enable: true if ETM base configuration has been done.
  * @boot_enable:True if we should start tracing at boot time.
  * @os_unlock:  True if access to management registers is allowed.
- * @instrp0:	Tracing of load and store instructions
- *		as P0 elements is supported.
- * @q_filt:	Q element filtering support, if Q elements are supported.
- * @trcbb:	Indicates if the trace unit supports branch broadcast tracing.
- * @trccond:	If the trace unit supports conditional
- *		instruction tracing.
- * @retstack:	Indicates if the implementation supports a return stack.
- * @trccci:	Indicates if the trace unit supports cycle counting
- *		for instruction.
- * @q_support:	Q element support characteristics.
- * @trc_error:	Whether a trace unit can trace a system
- *		error exception.
- * @syncpr:	Indicates if an implementation has a fixed
- *		synchronization period.
- * @stall_ctrl:	Enables trace unit functionality that prevents trace
- *		unit buffer overflows.
- * @sysstall:	Does the system support stall control of the PE?
- * @nooverflow:	Indicate if overflow prevention is supported.
- * @atbtrig:	If the implementation can support ATB triggers
- * @lpoverride:	If the implementation can support low-power state over.
  * @trfcr:	If the CPU supports FEAT_TRF, value of the TRFCR_ELx that
  *		allows tracing at all ELs. We don't want to compute this
  *		at runtime, due to the additional setting of TRFCR_CX when
  *		in EL2. Otherwise, 0.
  * @config:	structure holding configuration parameters.
  * @save_state:	State to be preserved across power loss
- * @skip_power_up: Indicates if an implementation can skip powering up
- *		   the trace unit.
  * @paused:	Indicates if the trace unit is paused.
  * @arch_features: Bitmap of arch features of etmv4 devices.
  */
@@ -1037,46 +1082,11 @@ struct etmv4_drvdata {
 	raw_spinlock_t			spinlock;
 	int				cpu;
 	u8				arch;
-	u8				nr_pe;
-	u8				nr_pe_cmp;
-	u8				nr_addr_cmp;
-	u8				nr_cntr;
-	u8				nr_ext_inp;
-	u8				numcidc;
-	u8				numextinsel;
-	u8				numvmidc;
-	u8				nrseqstate;
-	u8				nr_event;
-	u8				nr_resource;
-	u8				nr_ss_cmp;
+	struct etmv4_caps		caps;
 	u8				trcid;
-	u8				trcid_size;
-	u8				ts_size;
-	u8				ctxid_size;
-	u8				vmid_size;
-	u8				ccsize;
-	u16				ccitmin;
-	u8				s_ex_level;
-	u8				ns_ex_level;
-	u8				q_support;
-	u8				os_lock_model;
 	bool				sticky_enable : 1;
 	bool				boot_enable : 1;
 	bool				os_unlock : 1;
-	bool				instrp0 : 1;
-	bool				q_filt : 1;
-	bool				trcbb : 1;
-	bool				trccond : 1;
-	bool				retstack : 1;
-	bool				trccci : 1;
-	bool				trc_error : 1;
-	bool				syncpr : 1;
-	bool				stallctl : 1;
-	bool				sysstall : 1;
-	bool				nooverflow : 1;
-	bool				atbtrig : 1;
-	bool				lpoverride : 1;
-	bool				skip_power_up : 1;
 	bool				paused : 1;
 	u64				trfcr;
 	struct etmv4_config		config;
-- 
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}



^ permalink raw reply related

* [PATCH v5 04/12] coresight: etm4x: exclude ss_status from drvdata->config
From: Yeoreum Yun @ 2026-04-15 16:55 UTC (permalink / raw)
  To: coresight, linux-arm-kernel, linux-kernel
  Cc: suzuki.poulose, mike.leach, james.clark, alexander.shishkin,
	leo.yan, jie.gan, Yeoreum Yun
In-Reply-To: <20260415165528.3369607-1-yeoreum.yun@arm.com>

The purpose of TRCSSCSRn register is to show status of
the corresponding Single-shot Comparator Control and input supports.
That means writable field's purpose for reset or restore from idle status
not for configuration.

Therefore, exclude ss_status from drvdata->config, move it to etm4x_caps
and rename it to ss_smp.

This includes remove TRCSSCRn from configurable item and
remove saving in etm4_disable_hw().

Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
---
 .../hwtracing/coresight/coresight-etm4x-cfg.c |  1 -
 .../coresight/coresight-etm4x-core.c          | 19 ++++++-------------
 .../coresight/coresight-etm4x-sysfs.c         |  7 ++-----
 drivers/hwtracing/coresight/coresight-etm4x.h |  7 ++++++-
 4 files changed, 14 insertions(+), 20 deletions(-)

diff --git a/drivers/hwtracing/coresight/coresight-etm4x-cfg.c b/drivers/hwtracing/coresight/coresight-etm4x-cfg.c
index c302072b293a..d14d7c8a23e5 100644
--- a/drivers/hwtracing/coresight/coresight-etm4x-cfg.c
+++ b/drivers/hwtracing/coresight/coresight-etm4x-cfg.c
@@ -86,7 +86,6 @@ static int etm4_cfg_map_reg_offset(struct etmv4_drvdata *drvdata,
 		off_mask =  (offset & GENMASK(11, 5));
 		do {
 			CHECKREGIDX(TRCSSCCRn(0), ss_ctrl, idx, off_mask);
-			CHECKREGIDX(TRCSSCSRn(0), ss_status, idx, off_mask);
 			CHECKREGIDX(TRCSSPCICRn(0), ss_pe_cmp, idx, off_mask);
 		} while (0);
 	} else if ((offset >= TRCCIDCVRn(0)) && (offset <= TRCVMIDCVRn(7))) {
diff --git a/drivers/hwtracing/coresight/coresight-etm4x-core.c b/drivers/hwtracing/coresight/coresight-etm4x-core.c
index b2b092a76eb5..f55338a4989d 100644
--- a/drivers/hwtracing/coresight/coresight-etm4x-core.c
+++ b/drivers/hwtracing/coresight/coresight-etm4x-core.c
@@ -91,7 +91,7 @@ static bool etm4x_sspcicrn_present(struct etmv4_drvdata *drvdata, int n)
 	const struct etmv4_caps *caps = &drvdata->caps;
 
 	return (n < caps->nr_ss_cmp) && caps->nr_pe_cmp &&
-	       (drvdata->config.ss_status[n] & TRCSSCSRn_PC);
+	       (caps->ss_cmp[n] & TRCSSCSRn_PC);
 }
 
 u64 etm4x_sysreg_read(u32 offset, bool _relaxed, bool _64bit)
@@ -573,11 +573,9 @@ static int etm4_enable_hw(struct etmv4_drvdata *drvdata)
 		etm4x_relaxed_write32(csa, config->res_ctrl[i], TRCRSCTLRn(i));
 
 	for (i = 0; i < caps->nr_ss_cmp; i++) {
-		/* always clear status bit on restart if using single-shot */
-		if (config->ss_ctrl[i] || config->ss_pe_cmp[i])
-			config->ss_status[i] &= ~TRCSSCSRn_STATUS;
 		etm4x_relaxed_write32(csa, config->ss_ctrl[i], TRCSSCCRn(i));
-		etm4x_relaxed_write32(csa, config->ss_status[i], TRCSSCSRn(i));
+		/* always clear status and pending bits on restart if using single-shot */
+		etm4x_relaxed_write32(csa, 0x0, TRCSSCSRn(i));
 		if (etm4x_sspcicrn_present(drvdata, i))
 			etm4x_relaxed_write32(csa, config->ss_pe_cmp[i], TRCSSPCICRn(i));
 	}
@@ -1055,12 +1053,6 @@ static void etm4_disable_hw(struct etmv4_drvdata *drvdata)
 
 	etm4_disable_trace_unit(drvdata);
 
-	/* read the status of the single shot comparators */
-	for (i = 0; i < caps->nr_ss_cmp; i++) {
-		config->ss_status[i] =
-			etm4x_relaxed_read32(csa, TRCSSCSRn(i));
-	}
-
 	/* read back the current counter values */
 	for (i = 0; i < caps->nr_cntr; i++) {
 		config->cntr_val[i] =
@@ -1503,8 +1495,9 @@ static void etm4_init_arch_data(void *info)
 	 */
 	caps->nr_ss_cmp = FIELD_GET(TRCIDR4_NUMSSCC_MASK, etmidr4);
 	for (i = 0; i < caps->nr_ss_cmp; i++) {
-		drvdata->config.ss_status[i] =
-			etm4x_relaxed_read32(csa, TRCSSCSRn(i));
+		caps->ss_cmp[i] = etm4x_relaxed_read32(csa, TRCSSCSRn(i));
+		caps->ss_cmp[i] &= (TRCSSCSRn_PC | TRCSSCSRn_DV |
+				    TRCSSCSRn_DA | TRCSSCSRn_INST);
 	}
 	/* NUMCIDC, bits[27:24] number of Context ID comparators for tracing */
 	caps->numcidc = FIELD_GET(TRCIDR4_NUMCIDC_MASK, etmidr4);
diff --git a/drivers/hwtracing/coresight/coresight-etm4x-sysfs.c b/drivers/hwtracing/coresight/coresight-etm4x-sysfs.c
index 8bd28e71d4c9..5e26c2ec8f7b 100644
--- a/drivers/hwtracing/coresight/coresight-etm4x-sysfs.c
+++ b/drivers/hwtracing/coresight/coresight-etm4x-sysfs.c
@@ -1829,8 +1829,6 @@ static ssize_t sshot_ctrl_store(struct device *dev,
 	raw_spin_lock(&drvdata->spinlock);
 	idx = config->ss_idx;
 	config->ss_ctrl[idx] = FIELD_PREP(TRCSSCCRn_SAC_ARC_RST_MASK, val);
-	/* must clear bit 31 in related status register on programming */
-	config->ss_status[idx] &= ~TRCSSCSRn_STATUS;
 	raw_spin_unlock(&drvdata->spinlock);
 	return size;
 }
@@ -1841,10 +1839,11 @@ static ssize_t sshot_status_show(struct device *dev,
 {
 	unsigned long val;
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
+	const struct etmv4_caps *caps = &drvdata->caps;
 	struct etmv4_config *config = &drvdata->config;
 
 	raw_spin_lock(&drvdata->spinlock);
-	val = config->ss_status[config->ss_idx];
+	val = caps->ss_cmp[config->ss_idx];
 	raw_spin_unlock(&drvdata->spinlock);
 	return scnprintf(buf, PAGE_SIZE, "%#lx\n", val);
 }
@@ -1879,8 +1878,6 @@ static ssize_t sshot_pe_ctrl_store(struct device *dev,
 	raw_spin_lock(&drvdata->spinlock);
 	idx = config->ss_idx;
 	config->ss_pe_cmp[idx] = FIELD_PREP(TRCSSPCICRn_PC_MASK, val);
-	/* must clear bit 31 in related status register on programming */
-	config->ss_status[idx] &= ~TRCSSCSRn_STATUS;
 	raw_spin_unlock(&drvdata->spinlock);
 	return size;
 }
diff --git a/drivers/hwtracing/coresight/coresight-etm4x.h b/drivers/hwtracing/coresight/coresight-etm4x.h
index 8168676f2945..db56c4414873 100644
--- a/drivers/hwtracing/coresight/coresight-etm4x.h
+++ b/drivers/hwtracing/coresight/coresight-etm4x.h
@@ -213,6 +213,7 @@
 #define TRCACATRn_EXLEVEL_MASK			GENMASK(14, 8)
 
 #define TRCSSCSRn_STATUS			BIT(31)
+#define TRCSSCSRn_PENDING			BIT(30)
 #define TRCSSCCRn_SAC_ARC_RST_MASK		GENMASK(24, 0)
 
 #define TRCSSPCICRn_PC_MASK			GENMASK(7, 0)
@@ -729,6 +730,9 @@ static inline u32 etm4_res_sel_pair(u8 res_sel_idx)
 #define ETM_DEFAULT_ADDR_COMP		0
 
 #define TRCSSCSRn_PC			BIT(3)
+#define TRCSSCSRn_DV			BIT(2)
+#define TRCSSCSRn_DA			BIT(1)
+#define TRCSSCSRn_INST			BIT(0)
 
 /* PowerDown Control Register bits */
 #define TRCPDCR_PU			BIT(3)
@@ -861,6 +865,7 @@ enum etm_impdef_type {
  * @lpoverride:	If the implementation can support low-power state over.
  * @skip_power_up: Indicates if an implementation can skip powering up
  *		   the trace unit.
+ * @ss_cmp:	Indicates supported single-shot comparators.
  */
 struct etmv4_caps {
 	u8	nr_pe;
@@ -899,6 +904,7 @@ struct etmv4_caps {
 	bool	atbtrig : 1;
 	bool	lpoverride : 1;
 	bool	skip_power_up : 1;
+	u32	ss_cmp[ETM_MAX_SS_CMP];
 };
 
 /**
@@ -977,7 +983,6 @@ struct etmv4_config {
 	u32				res_ctrl[ETM_MAX_RES_SEL]; /* TRCRSCTLRn */
 	u8				ss_idx;
 	u32				ss_ctrl[ETM_MAX_SS_CMP];
-	u32				ss_status[ETM_MAX_SS_CMP];
 	u32				ss_pe_cmp[ETM_MAX_SS_CMP];
 	u8				addr_idx;
 	u64				addr_val[ETM_MAX_SINGLE_ADDR_CMP];
-- 
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}



^ permalink raw reply related

* [PATCH v5 05/12] coresight: etm4x: remove redundant fields in etmv4_save_state
From: Yeoreum Yun @ 2026-04-15 16:55 UTC (permalink / raw)
  To: coresight, linux-arm-kernel, linux-kernel
  Cc: suzuki.poulose, mike.leach, james.clark, alexander.shishkin,
	leo.yan, jie.gan, Yeoreum Yun
In-Reply-To: <20260415165528.3369607-1-yeoreum.yun@arm.com>

Some of fields are redundant in etmv4_save_state and never used:

    ss_status => trcsscsr
    seq_state => trcseqstr
    cntr_val => trccntvr
    vinst_ctrl => trcvictlr

Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
---
 drivers/hwtracing/coresight/coresight-etm4x.h | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/drivers/hwtracing/coresight/coresight-etm4x.h b/drivers/hwtracing/coresight/coresight-etm4x.h
index db56c4414873..cbd8890d166a 100644
--- a/drivers/hwtracing/coresight/coresight-etm4x.h
+++ b/drivers/hwtracing/coresight/coresight-etm4x.h
@@ -1047,11 +1047,6 @@ struct etmv4_save_state {
 
 	u32	trcclaimset;
 
-	u32	cntr_val[ETMv4_MAX_CNTR];
-	u32	seq_state;
-	u32	vinst_ctrl;
-	u32	ss_status[ETM_MAX_SS_CMP];
-
 	u32	trcpdcr;
 };
 
-- 
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}



^ permalink raw reply related

* [PATCH v5 02/12] coresight: etm4x: fix underflow for nrseqstate
From: Yeoreum Yun @ 2026-04-15 16:55 UTC (permalink / raw)
  To: coresight, linux-arm-kernel, linux-kernel
  Cc: suzuki.poulose, mike.leach, james.clark, alexander.shishkin,
	leo.yan, jie.gan, Yeoreum Yun
In-Reply-To: <20260415165528.3369607-1-yeoreum.yun@arm.com>

TCRSEQEVR<n> is implemented only when TCRIDR5.NUMSEQSTATE is 0b100,
in which case n ranges from 0 to 2; otherwise, TCRIDR5.NUMSEQSTATE is 0b000.

Therefore, drvdata->nrseqstate should be checked before entering the loop.

Link: https://developer.arm.com/documentation/ihi0064/latest/ [0]
Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
---
 .../hwtracing/coresight/coresight-etm4x-core.c | 18 ++++++++++--------
 .../coresight/coresight-etm4x-sysfs.c          |  2 ++
 2 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/drivers/hwtracing/coresight/coresight-etm4x-core.c b/drivers/hwtracing/coresight/coresight-etm4x-core.c
index 74b7063d130e..ba5b8b423bd4 100644
--- a/drivers/hwtracing/coresight/coresight-etm4x-core.c
+++ b/drivers/hwtracing/coresight/coresight-etm4x-core.c
@@ -544,9 +544,11 @@ static int etm4_enable_hw(struct etmv4_drvdata *drvdata)
 	etm4x_relaxed_write32(csa, config->vissctlr, TRCVISSCTLR);
 	if (drvdata->nr_pe_cmp)
 		etm4x_relaxed_write32(csa, config->vipcssctlr, TRCVIPCSSCTLR);
-	for (i = 0; i < drvdata->nrseqstate - 1; i++)
-		etm4x_relaxed_write32(csa, config->seq_ctrl[i], TRCSEQEVRn(i));
+
 	if (drvdata->nrseqstate) {
+		for (i = 0; i < drvdata->nrseqstate - 1; i++)
+			etm4x_relaxed_write32(csa, config->seq_ctrl[i], TRCSEQEVRn(i));
+
 		etm4x_relaxed_write32(csa, config->seq_rst, TRCSEQRSTEVR);
 		etm4x_relaxed_write32(csa, config->seq_state, TRCSEQSTR);
 	}
@@ -1927,10 +1929,10 @@ static int __etm4_cpu_save(struct etmv4_drvdata *drvdata)
 	if (drvdata->nr_pe_cmp)
 		state->trcvipcssctlr = etm4x_read32(csa, TRCVIPCSSCTLR);
 
-	for (i = 0; i < drvdata->nrseqstate - 1; i++)
-		state->trcseqevr[i] = etm4x_read32(csa, TRCSEQEVRn(i));
-
 	if (drvdata->nrseqstate) {
+		for (i = 0; i < drvdata->nrseqstate - 1; i++)
+			state->trcseqevr[i] = etm4x_read32(csa, TRCSEQEVRn(i));
+
 		state->trcseqrstevr = etm4x_read32(csa, TRCSEQRSTEVR);
 		state->trcseqstr = etm4x_read32(csa, TRCSEQSTR);
 	}
@@ -2058,10 +2060,10 @@ static void __etm4_cpu_restore(struct etmv4_drvdata *drvdata)
 	if (drvdata->nr_pe_cmp)
 		etm4x_relaxed_write32(csa, state->trcvipcssctlr, TRCVIPCSSCTLR);
 
-	for (i = 0; i < drvdata->nrseqstate - 1; i++)
-		etm4x_relaxed_write32(csa, state->trcseqevr[i], TRCSEQEVRn(i));
-
 	if (drvdata->nrseqstate) {
+		for (i = 0; i < drvdata->nrseqstate - 1; i++)
+			etm4x_relaxed_write32(csa, state->trcseqevr[i], TRCSEQEVRn(i));
+
 		etm4x_relaxed_write32(csa, state->trcseqrstevr, TRCSEQRSTEVR);
 		etm4x_relaxed_write32(csa, state->trcseqstr, TRCSEQSTR);
 	}
diff --git a/drivers/hwtracing/coresight/coresight-etm4x-sysfs.c b/drivers/hwtracing/coresight/coresight-etm4x-sysfs.c
index e9eeea6240d5..d866dcfa2a36 100644
--- a/drivers/hwtracing/coresight/coresight-etm4x-sysfs.c
+++ b/drivers/hwtracing/coresight/coresight-etm4x-sysfs.c
@@ -1395,6 +1395,8 @@ static ssize_t seq_idx_store(struct device *dev,
 	struct etmv4_drvdata *drvdata = dev_get_drvdata(dev->parent);
 	struct etmv4_config *config = &drvdata->config;
 
+	if (!drvdata->nrseqstate)
+		return -EINVAL;
 	if (kstrtoul(buf, 16, &val))
 		return -EINVAL;
 	if (val >= drvdata->nrseqstate - 1)
-- 
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}



^ permalink raw reply related

* [PATCH v5 01/12] coresight: etm4x: fix wrong check of etm4x_sspcicrn_present()
From: Yeoreum Yun @ 2026-04-15 16:55 UTC (permalink / raw)
  To: coresight, linux-arm-kernel, linux-kernel
  Cc: suzuki.poulose, mike.leach, james.clark, alexander.shishkin,
	leo.yan, jie.gan, Yeoreum Yun
In-Reply-To: <20260415165528.3369607-1-yeoreum.yun@arm.com>

According to Embedded Trace Macrocell Architecture Specification
ETMv4.0 to ETM4.6 [0], TRCSSPCICR<n> is present only if all of
the following are true:

  - TRCIDR4.NUMSSCC > n.
  - TRCIDR4.NUMPC > 0b0000.
  - TRCSSCSR<n>.PC == 0b1.

Comment for etm4x_sspcicrn_present() is align with the specification.
However, the check should use drvdata->nr_pe_cmp to check TRCIDR4.NUMPC
not nr_pe.

Link: https://developer.arm.com/documentation/ihi0064/latest/ [0]
Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
---
 drivers/hwtracing/coresight/coresight-etm4x-core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/hwtracing/coresight/coresight-etm4x-core.c b/drivers/hwtracing/coresight/coresight-etm4x-core.c
index d565a73f0042..74b7063d130e 100644
--- a/drivers/hwtracing/coresight/coresight-etm4x-core.c
+++ b/drivers/hwtracing/coresight/coresight-etm4x-core.c
@@ -89,7 +89,7 @@ static int etm4_probe_cpu(unsigned int cpu);
 static bool etm4x_sspcicrn_present(struct etmv4_drvdata *drvdata, int n)
 {
 	return (n < drvdata->nr_ss_cmp) &&
-	       drvdata->nr_pe &&
+	       drvdata->nr_pe_cmp &&
 	       (drvdata->config.ss_status[n] & TRCSSCSRn_PC);
 }
 
-- 
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}



^ permalink raw reply related

* [PATCH v5 00/12] fix several inconsistencies with sysfs configuration in etmX
From: Yeoreum Yun @ 2026-04-15 16:55 UTC (permalink / raw)
  To: coresight, linux-arm-kernel, linux-kernel
  Cc: suzuki.poulose, mike.leach, james.clark, alexander.shishkin,
	leo.yan, jie.gan, Yeoreum Yun

The current ETMx configuration via sysfs can lead to the following
inconsistencies:

  - If a configuration is modified via sysfs while a perf session is
    active, the running configuration may differ between before
    a sched-out and after a subsequent sched-in.

  - If a perf session and sysfs session tries to enable concurrently,
    configuration from configfs could be corrupted (etm4).

  - There is chance to corrupt drvdata->config if perf session tries
    to enabled among handling cscfg_csdev_disable_active_config()
    in etm4_disable_sysfs() (etm4).

To resolve these inconsistencies, the configuration should be separated into:

  - active_config, which is applied configuration for the current session
  - config, which stores the settings configured via sysfs.

and apply configuration from configfs after taking a mode.

Also, This patch set includes some small fixes:
  - missing trace id release in etm4x.
  - underflow issue for nrseqstate.
  - wrong check in etm4x_sspcicrn_present().

This patch based on v7.0

Patch History
=============
from v4 to v5:
  - add rb-tag.
  - fix underflow issue for nrseqstate.
  - fix wrong check in etm4_sspcicrn_present().
  - remove redundant fields on etmv4_save_state.
  - rename caps->ss_status to ss_cmp.
  - fix wrong location of etm4_release_trace_id.
  - https://lore.kernel.org/all/20260413142003.3549310-1-yeoreum.yun@arm.com/

from v3 to v4:
  - change etm_drvdata->spinlock type to raw_spin_lock_t
  - remove redundant call etmX_enable_hw() with starting_cpu() callsback.
  - fix missing trace id release.
  - add missing docs.
  - https://lore.kernel.org/all/20260412175506.412301-1-yeoreum.yun@arm.com/

from v2 to v3:
  - fix build error for etm3x.
  - fix checkpatch warning.
  - https://lore.kernel.org/all/20260410074310.2693385-1-yeoreum.yun@arm.com/

from v1 to v2
  - rebased to v7.0-rc7.
  - introduce etmX_caps structure to save etmX's capabilities.
  - remove ss_status from etmv4_config.
  - modify active_config after taking a mode (perf/sysfs).
  - https://lore.kernel.org/all/20260317181705.2456271-1-yeoreum.yun@arm.com/

Yeoreum Yun (12):
  coresight: etm4x: fix wrong check of etm4x_sspcicrn_present()
  coresight: etm4x: fix underflow for nrseqstate
  coresight: etm4x: introduce struct etm4_caps
  coresight: etm4x: exclude ss_status from drvdata->config
  coresight: etm4x: remove redundant fields in etmv4_save_state
  coresight: etm4x: fix leaked trace id
  coresight: etm4x: fix inconsistencies with sysfs configuration
  coresight: etm4x: remove redundant call etm4_enable_hw() with hotplug
  coresight: etm3x: change drvdata->spinlock type to raw_spin_lock_t
  coresight: etm3x: introduce struct etm_caps
  coresight: etm3x: fix inconsistencies with sysfs configuration
  coresight: etm3x: remove redundant call etm_enable_hw() with hotplug

 drivers/hwtracing/coresight/coresight-etm.h   |  46 ++-
 .../coresight/coresight-etm3x-core.c          | 100 ++---
 .../coresight/coresight-etm3x-sysfs.c         | 159 ++++----
 .../hwtracing/coresight/coresight-etm4x-cfg.c |   3 +-
 .../coresight/coresight-etm4x-core.c          | 374 ++++++++++--------
 .../coresight/coresight-etm4x-sysfs.c         | 199 ++++++----
 drivers/hwtracing/coresight/coresight-etm4x.h | 190 ++++-----
 7 files changed, 588 insertions(+), 483 deletions(-)


base-commit: 028ef9c96e96197026887c0f092424679298aae8
--
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}



^ permalink raw reply

* [RFC PATCH v3 2/7] hq-spinlock: implement inner logic
From: Fedorov Nikita @ 2026-04-15 16:44 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, hpa, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, bcm-kernel-feedback-list, Arnd Bergmann,
	Peter Zijlstra, Boqun Feng, Waiman Long, Darren Hart,
	Davidlohr Bueso, andrealmeid, Andrew Morton, David Hildenbrand,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, byungchul,
	Gregory Price, Ying Huang, Alistair Popple, Anatoly Stepanov
  Cc: Nikita Fedorov, linux-arm-kernel, linux-kernel, virtualization,
	linux-arch, linux-mm, guohanjun, wangkefeng.wang, weiyongjun1,
	yusongping, leijitang, artem.kuzin, kang.sun, chenjieping3
In-Reply-To: <20260415164459.2904963-1-fedorov.nikita@h-partners.com>

The design might be considered as a combination of cohort-locking
by Dave Dice and Linux kernel queued spinlock.

The contenders are organized into 2-level scheme where each NUMA-node
has it's own FIFO contender queue.

NUMA-queues are linked into single-linked list alike structure, while
maintaining FIFO order between them.
When there're no contenders left in a NUMA-queue, the queue is removed from
the list.

Contenders try to enqueue to local NUMA-queue first and if there's
no such queue - link it to the list.
As for "qspinlock" only the first contender is spinning globally,
all others - MCS-spinning.

"Handoff" operation becomes two-staged:
- local handoff: between contenders in a single NUMA-queue
- remote handoff: between different NUMA-queues.

If "remote handoff" reaches the end of the NUMA-queue linked list it goes to
the list head.

To avoid potential starvation issues, we use handoff thresholds.

Key challenge here was keeping the same "qspinlock" structure size
and "bind" a given lock to related NUMA-queues.

We came up with dynamic lock metadata concept, where we can
dynamically "bind" a given lock to it's NUMA-related metadata, and
then "unbind" when the lock is released.

This approach allows to avoid metadata reservation for every lock in advance,
thus giving the upperbound of metadata instance number to ~ (NR_CPUS x nr_contexts / 2).
Which corresponds to maximum amount of different locks falling into the slowpath.

HQ-lock supports switching between "default qspinlock" mode to "NUMA-aware lock" mode and backwards.

If for some reason "NUMA-aware" mode cannot be enabled we fallback to default qspinlock mode.

Functions that will be used in generic qspinlock code are started with "hqlock_"
(`hqlock_xchg_tail`, `hqlock_clear_pending`, `hqlock_clear_pending_set_locked`,
`hqlock_try_clear_tail`, `hqlock_handoff`).

Testing with locktorture module shows that this design reduces overhead
on both x86 and arm64 NUMA systems. On AMD EPYC 9654, throughput gains reach
up to 186% in the evaluated NPS=12 configuration. On Kunpeng 920,
throughput gains range from 93% to 158% across the tested thread
counts. Fairness on AMD EPYC 9654 remained in the 0.51-0.61 range in
the evaluated configurations.

Co-developed-by: Anatoly Stepanov <stepanov.anatoly@huawei.com>
Signed-off-by: Anatoly Stepanov <stepanov.anatoly@huawei.com>
Co-developed-by: Nikita Fedorov <fedorov.nikita@h-partners.com>
Signed-off-by: Nikita Fedorov <fedorov.nikita@h-partners.com>

---
The full benchmark tables are included in the cover letter.
---
 kernel/locking/hqlock_core.h | 715 +++++++++++++++++++++++++++++++++++
 kernel/locking/hqlock_meta.h | 467 +++++++++++++++++++++++
 2 files changed, 1182 insertions(+)
 create mode 100644 kernel/locking/hqlock_core.h
 create mode 100644 kernel/locking/hqlock_meta.h

diff --git a/kernel/locking/hqlock_core.h b/kernel/locking/hqlock_core.h
new file mode 100644
index 0000000000..7322199228
--- /dev/null
+++ b/kernel/locking/hqlock_core.h
@@ -0,0 +1,715 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _GEN_HQ_SPINLOCK_SLOWPATH
+#error "Do not include this file!"
+#endif
+
+#include <linux/nodemask.h>
+#include <linux/topology.h>
+#include <linux/sched/clock.h>
+#include <linux/moduleparam.h>
+#include <linux/sched/rt.h>
+#include <linux/random.h>
+#include <linux/mm.h>
+#include <linux/memblock.h>
+#include <linux/sysctl.h>
+#include <linux/types.h>
+#include <linux/percpu.h>
+#include <linux/slab.h>
+#include <linux/panic.h>
+#include <linux/vmalloc.h>
+#include <linux/slab.h>
+#include <linux/syscalls.h>
+#include <linux/sprintf.h>
+#include <linux/proc_fs.h>
+#include <linux/uaccess.h>
+#include <linux/swab.h>
+#include <linux/hash.h>
+
+/* Contains queues for all possible lock id */
+static struct numa_queue *queue_table[MAX_NUMNODES];
+
+#include "hqlock_types.h"
+#include "hqlock_proc.h"
+
+/* Gets node_id (1..N) */
+static inline struct numa_queue *
+get_queue(u16 lock_id, u16 node_id)
+{
+	return &queue_table[node_id - 1][lock_id];
+}
+
+static inline struct numa_queue *
+get_local_queue(struct numa_qnode *qnode)
+{
+	return get_queue(qnode->lock_id, qnode->numa_node);
+}
+
+static inline void init_queue_link(struct numa_queue *queue)
+{
+	queue->prev_node = 0;
+	queue->next_node = 0;
+}
+
+static inline void init_queue(struct numa_qnode *qnode)
+{
+	struct numa_queue *queue = get_local_queue(qnode);
+
+	queue->head = qnode;
+	queue->handoffs_not_head = 0;
+	init_queue_link(queue);
+}
+
+static void set_next_queue(u16 lock_id, u16 prev_node_id, u16 node_id)
+{
+	struct numa_queue *local_queue = get_queue(lock_id, node_id);
+	struct numa_queue *prev_queue =
+		get_queue(lock_id, prev_node_id);
+
+	WRITE_ONCE(local_queue->prev_node, prev_node_id);
+	/*
+	 * Needs to be guaranteed the following:
+	 * when appending "local_queue", if "prev_queue->next_node" link
+	 * is observed then "local_queue->prev_node" is also observed.
+	 *
+	 * We need this to guarantee correctness of concurrent
+	 * "unlink_node_queue" for the "prev_queue", if "prev_queue" is the first in the list.
+	 * [prev_queue] <-> [local_queue]
+	 *
+	 * In this case "unlink_node_queue" would be setting "local_queue->prev_node = 0", thus
+	 * w/o the smp-barrier, it might race with "set_next_queue", if
+	 * "local_queue->prev_node = prev_node_id" happens afterwards, leading to corrupted list.
+	 */
+	smp_wmb();
+	WRITE_ONCE(prev_queue->next_node, node_id);
+}
+
+static inline struct lock_metadata *get_meta(u16 lock_id);
+
+/**
+ * Put new node's queue into global NUMA-level queue
+ */
+static inline u16 append_node_queue(u16 lock_id, u16 node_id)
+{
+	struct lock_metadata *lock_meta = get_meta(lock_id);
+	u16 prev_node_id = xchg(&lock_meta->tail_node, node_id);
+
+	if (prev_node_id)
+		set_next_queue(lock_id, prev_node_id, node_id);
+	else
+		WRITE_ONCE(lock_meta->head_node, node_id);
+	return prev_node_id;
+}
+
+#include "hqlock_meta.h"
+
+/**
+ * Update tail
+ *
+ * Call proper function depending on lock's mode
+ * until successful queuing
+ */
+static inline u32 hqlock_xchg_tail(struct qspinlock *lock, u32 tail,
+				 struct mcs_spinlock *node, bool *numa_awareness_on)
+{
+	struct numa_qnode *qnode = (struct numa_qnode *)node;
+
+	u16 lock_id;
+	u32 old_tail;
+	u32 next_tail = tail;
+
+	/*
+	 * Key lock's mode switches questions:
+	 * - After init lock is in LOCK_MODE_QSPINLOCK
+	 * - If many contenders have come while lock was in LOCK_MODE_QSPINLOCK,
+	 *   we want this lock to use NUMA awareness next time,
+	 *   so we clean LOCK_MODE_QSPINLOCK, see 'low_contention_try_clear_tail'
+	 * - During next lock's usages we try to go through NUMA-aware path.
+	 *   We can fail here, because we use shared metadata
+	 *   and can have a conflict with another lock, see 'hqlock_meta.h' for details.
+	 *   In this case we fallback to generic qspinlock approach.
+	 *
+	 * In other words, lock can be in 3 mode states:
+	 *
+	 * 1. LOCK_MODE_QSPINLOCK - there was low contention or not at all earlier,
+	 *    or (unlikely) a conflict in metadata
+	 * 2. LOCK_NO_MODE - there was a contention on a lock earlier,
+	 *    now there are no contenders in the queue (we are likely the first)
+	 *    and we need to try using NUMA awareness
+	 * 3. LOCK_MODE_HQLOCK - lock is currently under contention
+	 *    and using NUMA awareness.
+	 */
+
+	/*
+	 * numa_awareness_on == false means we saw LOCK_MODE_QSPINLOCK (1st state)
+	 * before starting slowpath, see 'queued_spin_lock_slowpath'
+	 */
+	if (*numa_awareness_on == false &&
+		try_update_tail_qspinlock_mode(lock, tail, &old_tail, &next_tail))
+		return old_tail;
+
+	/* Calculate the lock_id hash here once */
+	qnode->lock_id = lock_id = hash_ptr(lock, LOCK_ID_BITS);
+
+try_again:
+	/*
+	 * Lock is in state 2 or 3 - go through NUMA-aware path
+	 */
+	if (try_update_tail_hqlock_mode(lock, lock_id, qnode, tail, &next_tail, &old_tail)) {
+		*numa_awareness_on = true;
+		return old_tail;
+	}
+
+	/*
+	 * We have failed (conflict in metadata), now lock is in LOCK_MODE_QSPINLOCK again
+	 */
+	if (try_update_tail_qspinlock_mode(lock, tail, &old_tail, &next_tail)) {
+		*numa_awareness_on = false;
+		return old_tail;
+	}
+
+	/*
+	 * We were slow and clear_tail after high contention has already happened
+	 * (very unlikely situation)
+	 */
+	goto try_again;
+}
+
+static inline void hqlock_clear_pending(struct qspinlock *lock, u32 old_val)
+{
+	WRITE_ONCE(lock->pending, (old_val & _Q_LOCK_TYPE_MODE_MASK) >> _Q_PENDING_OFFSET);
+}
+
+static inline void hqlock_clear_pending_set_locked(struct qspinlock *lock, u32 old_val)
+{
+	WRITE_ONCE(lock->locked_pending,
+			_Q_LOCKED_VAL | (old_val & _Q_LOCK_TYPE_MODE_MASK));
+}
+
+static inline void unlink_node_queue(u16 lock_id,
+					u16 prev_node_id,
+					u16 next_node_id)
+{
+	struct numa_queue *prev_queue =
+		prev_node_id ? get_queue(lock_id, prev_node_id) : NULL;
+	struct numa_queue *next_queue = get_queue(lock_id, next_node_id);
+
+	if (prev_queue)
+		WRITE_ONCE(prev_queue->next_node, next_node_id);
+	/*
+	 * This is guaranteed to be ordered "after" next_node_id observation
+	 * by implicit full-barrier in the caller-code.
+	 */
+	WRITE_ONCE(next_queue->prev_node, prev_node_id);
+}
+
+static inline bool try_clear_queue_tail(struct numa_queue *queue, u32 tail)
+{
+	/*
+	 * We need full ordering here to:
+	 * - ensure all prior operations with global tail and prev_queue
+	 *   are observed before clearing local tail
+	 * - guarantee all subsequent operations
+	 *   with metadata release, unlink etc will be observed after clearing local tail
+	 */
+	return cmpxchg(&queue->tail, tail, 0) == tail;
+}
+
+/*
+ * Determine if we have another local and global contenders.
+ * Try clear local and global tail, understand handoff type we need to perform.
+ * In case we are the last, free lock's metadata
+ */
+static inline bool hqlock_try_clear_tail(struct qspinlock *lock, u32 val,
+				       u32 tail, struct mcs_spinlock *node,
+				       int *p_next_node)
+{
+	bool ret = false;
+	struct numa_qnode *qnode = (void *)node;
+
+	u16 lock_id = qnode->lock_id;
+	u16 local_node = qnode->numa_node;
+	struct numa_queue *queue = get_queue(lock_id, qnode->numa_node);
+
+	struct lock_metadata *lock_meta = get_meta(lock_id);
+
+	u16 prev_node = 0, next_node = 0;
+	u16 node_tail;
+
+	u32 old_val;
+
+	bool lock_tail_updated = false;
+	bool lock_tail_cleared = false;
+
+	/* Do we have *next node* arrived */
+	bool pending_next_node = false;
+
+	tail >>= _Q_TAIL_OFFSET;
+
+	/* Do we have other CPUs in the node queue ? */
+	if (READ_ONCE(queue->tail) != tail) {
+		*p_next_node = HQLOCK_HANDOFF_LOCAL;
+		goto out;
+	}
+
+	/*
+	 * Key observations and actions:
+	 * 1) next queue isn't observed:
+	 *    a) if prev queue is observed, try to unpublish local queue
+	 *    b) if prev queue is not observed, try to clean global tail
+	 *    Anyway, perform these operations before clearing local tail.
+	 *
+	 *    Such trick is essential to safely unlink the local queue,
+	 *    otherwise we could race with upcomming local contenders,
+	 *    which will perform 'append_node_queue' while our unlink is not properly done.
+	 *
+	 * 2) next queue is observed:
+	 *    safely perform 'try_clear_queue_tail' and unlink local node if succeeded.
+	 */
+
+	prev_node = READ_ONCE(queue->prev_node);
+	pending_next_node = READ_ONCE(lock_meta->tail_node) != local_node;
+
+	/*
+	 * Tail case:
+	 * [prev_node] -> [local_node], lock_meta->tail_node == local_node
+	 *
+	 * There're no nodes after us at the moment, try updating the "lock_meta->tail_node"
+	 */
+	if (!pending_next_node && prev_node) {
+		struct numa_queue *prev_queue =
+			get_queue(lock_id, prev_node);
+
+		/* Reset next_node, in case no one will come after */
+		WRITE_ONCE(prev_queue->next_node, 0);
+
+		/*
+		 * release to publish prev_queue->next_node = 0
+		 * and to ensure ordering with 'READ_ONCE(queue->tail) != tail'
+		 */
+		if (cmpxchg_release(&lock_meta->tail_node, local_node, prev_node) == local_node) {
+			lock_tail_updated = true;
+
+			queue->next_node = 0;
+			queue->prev_node = 0;
+			next_node = 0;
+		} else {
+			/* If some node came after the local meanwhile, reset next_node back */
+			WRITE_ONCE(prev_queue->next_node, local_node);
+
+			/* We either observing updated "queue->next" or it equals zero */
+			next_node = READ_ONCE(queue->next_node);
+		}
+	}
+
+	node_tail = READ_ONCE(lock_meta->tail_node);
+
+	/* If nobody else is waiting, try clean global tail */
+	if (node_tail == local_node && !prev_node) {
+		old_val = (((u32)local_node) | (((u32)local_node) << 16));
+		/* release to ensure ordering with 'READ_ONCE(queue->tail) != tail' */
+		lock_tail_cleared = try_cmpxchg_release(&lock_meta->nodes_tail, &old_val, 0);
+	}
+
+	/*
+	 * lock_meta->tail_node was not updated and cleared,
+	 * so we have at least single non-empty node after us
+	 */
+	if (!lock_tail_updated && !lock_tail_cleared) {
+		/*
+		 * If there's a node came before clearing node queue - wait for it to link properly.
+		 * We need this for correct upcoming *unlink*, otherwise the *unlink* might race with parallel set_next_node()
+		 */
+		if (!next_node) {
+			next_node =
+				smp_cond_load_relaxed(&queue->next_node, (VAL));
+		}
+	}
+
+	/* if we're the last one in the queue - clear the queue tail */
+	if (try_clear_queue_tail(queue, tail)) {
+		/*
+		 * "lock_tail_cleared == true"
+		 * It means: we cleared "lock_meta->tail_node" and "lock_meta->head_node".
+		 *
+		 * First new contender will do "global spin" anyway, so no handoff needed
+		 * "ret == true"
+		 */
+		if (lock_tail_cleared) {
+			ret = true;
+
+			/*
+			 * If someone has arrived in the meanwhile,
+			 * don't try to free the metadata.
+			 */
+			old_val = READ_ONCE(lock_meta->nodes_tail);
+			if (!old_val) {
+				/*
+				 * We are probably the last contender,
+				 * so, need to free lock's metadata.
+				 */
+				release_lock_meta(lock, lock_meta, qnode);
+			}
+			goto out;
+		}
+
+		/*
+		 * "lock_tail_updated == true" (implies "lock_tail_cleared == false")
+		 * It means we have at least "prev_node" and unlinked "local node"
+		 *
+		 * As we unlinked "local node", we only need to guarantee correct
+		 * remote handoff, thus we have:
+		 * "ret == false"
+		 * "next_node == HQLOCK_HANDOFF_REMOTE_HEAD"
+		 */
+		if (lock_tail_updated) {
+			*p_next_node = HQLOCK_HANDOFF_REMOTE_HEAD;
+			goto out;
+		}
+
+		/*
+		 * "!lock_tail_cleared && !lock_tail_updated"
+		 * It means we have at least single node after us.
+		 *
+		 * remote handoff and corect "local node" unlink are needed.
+		 *
+		 * "next_node" visibility guarantees that we observe
+		 * correctly additon of "next_node", so the following unlink
+		 * is safe and correct.
+		 *
+		 * "next_node > 0"
+		 * "ret == false"
+		 */
+		unlink_node_queue(lock_id, prev_node, next_node);
+
+		/*
+		 * If at the head - update one.
+		 *
+		 * Another place, where "lock_meta->head_node" is updated is "append_node_queue"
+		 * But we're safe, as that happens only with the first node on empty "node list".
+		 */
+		if (!prev_node)
+			WRITE_ONCE(lock_meta->head_node, next_node);
+
+		*p_next_node = next_node;
+	} else {
+		/*
+		 * local queue has other contenders.
+		 *
+		 * 1) "lock_tail_updated == true":
+		 * It means we have at least "prev_node" and unlinked "local node"
+		 * Also, some new nodes can arrive and link after "prev_node".
+		 * We should just re-add "local node": (prev_node) => ... => (local_node)
+		 * and perform local handoff, as other CPUs from the local node do "mcs spin"
+		 *
+		 * 2) "lock_tail_cleared == true"
+		 * It means we cleared "lock_meta->tail_node" and "lock->head_node".
+		 * We need to re-add "local node" and move "local_queue->head" to the next "mcs-node",
+		 * which is in the progress of linking after the current "mcs-node"
+		 * (that's why we couldn't clear the "local_queue->tail").
+		 *
+		 * Meanwhile other nodes can arrive: (new_node) => (...)
+		 * That "new_node" will spin in "global spin" mode.
+		 * In this case no handoff needed.
+		 *
+		 * 3) "!lock_tail_cleared && !lock_tail_updated"
+		 * It means we had at least one node after us before 'try_clear_queue_tail'
+		 * and only need to perform local handoff
+		 */
+
+		/* Cases 1) and 2) */
+		if (lock_tail_updated || lock_tail_cleared) {
+			u16 prev_node_id;
+
+			init_queue_link(queue);
+			prev_node_id =
+				append_node_queue(lock_id, local_node);
+
+			if (prev_node_id && lock_tail_cleared) {
+				/* Case 2) */
+				ret = true;
+				WRITE_ONCE(queue->head,
+					   (void *) smp_cond_load_relaxed(&node->next, (VAL)));
+				goto out;
+			}
+		}
+
+		/* Cases 1) and 3) */
+		*p_next_node = HQLOCK_HANDOFF_LOCAL;
+		ret = false;
+	}
+out:
+	/*
+	 * Either handoff for current node,
+	 * or remote handoff if the quota is expired
+	 */
+	return ret;
+}
+
+static inline void hqlock_handoff(struct qspinlock *lock,
+					 struct mcs_spinlock *node,
+					 struct mcs_spinlock *next, u32 tail,
+					 int handoff_info);
+
+
+/*
+ * Chech if contention has risen and if we need to set NUMA-aware mode
+ */
+static __always_inline bool determine_contention_qspinlock_mode(struct mcs_spinlock *node)
+{
+	struct numa_qnode *qnode = (void *)node;
+
+	if (qnode->general_handoffs > READ_ONCE(hqlock_general_handoffs_turn_numa))
+		return true;
+	return false;
+}
+
+static __always_inline bool low_contention_try_clear_tail(struct qspinlock *lock,
+					     u32 val,
+					     struct mcs_spinlock *node)
+{
+	u32 update_val = _Q_LOCKED_VAL | _Q_LOCKTYPE_HQ;
+
+	bool high_contention = determine_contention_qspinlock_mode(node);
+
+	/*
+	 * If we have high contention, we set _Q_LOCK_INVALID_TAIL
+	 * to notify upcomming contenders, which have seen QSPINLOCK mode,
+	 * that performing generic 'xchg_tail' is wrong.
+	 *
+	 * We cannot also set HQLOCK mode here,
+	 * because first contender in updated mode
+	 * should check if lock's metadata is free
+	 */
+	if (!high_contention)
+		update_val |= _Q_LOCK_MODE_QSPINLOCK_VAL;
+	else
+		update_val |= _Q_LOCK_INVALID_TAIL;
+
+	return atomic_try_cmpxchg_relaxed(&lock->val, &val, update_val);
+}
+
+static __always_inline void low_contention_mcs_lock_handoff(struct mcs_spinlock *node,
+					       struct mcs_spinlock *next, struct mcs_spinlock *prev)
+{
+	struct numa_qnode *qnode = (void *)node;
+	struct numa_qnode *qnext = (void *)next;
+
+	static u16 max_u16 = (u16)(-1);
+
+	u16 general_handoffs = qnode->general_handoffs;
+
+	if (next != prev && likely(general_handoffs + 1 != max_u16))
+		general_handoffs++;
+
+	qnext->general_handoffs = general_handoffs;
+	arch_mcs_spin_unlock_contended(&next->locked);
+}
+
+static inline void hqlock_clear_tail_handoff(struct qspinlock *lock, u32 val,
+				    u32 tail,
+				    struct mcs_spinlock *node,
+				    struct mcs_spinlock *next,
+					struct mcs_spinlock *prev,
+					bool is_numa_lock)
+{
+	int handoff_info;
+	struct numa_qnode *qnode = (void *)node;
+
+	/*
+	 * qnode->wrong_fallback_tail means we have queued globally
+	 * in 'try_update_tail_qspinlock_mode' after another contender,
+	 * but lock's mode was not QSPINLOCK in that moment.
+	 *
+	 * First confused contender has restored _Q_LOCK_INVALID_TAIL in global tail
+	 * and set us in his local queue.
+	 */
+	if (is_numa_lock || qnode->wrong_fallback_tail) {
+		/*
+		 * Because of splitting generic tail and NUMA tail we must set locked before clearing tail,
+		 * otherwise double lock is possible
+		 */
+		set_locked(lock);
+
+		if (hqlock_try_clear_tail(lock, val, tail, node, &handoff_info))
+			return;
+
+		hqlock_handoff(lock, node, next, tail, handoff_info);
+	} else {
+		if ((val & _Q_TAIL_MASK) == tail) {
+			if (low_contention_try_clear_tail(lock, val, node))
+				return;
+		}
+
+		set_locked(lock);
+
+		if (!next)
+			next = smp_cond_load_relaxed(&node->next, (VAL));
+
+		low_contention_mcs_lock_handoff(node, next, prev);
+	}
+}
+
+static inline void hqlock_init_node(struct mcs_spinlock *node)
+{
+	struct numa_qnode *qnode = (void *)node;
+
+	qnode->general_handoffs = 0;
+	qnode->numa_node = numa_node_id() + 1;
+	qnode->lock_id = 0;
+	qnode->wrong_fallback_tail = 0;
+}
+
+static inline void reset_handoff_counter(struct numa_qnode *qnode)
+{
+	qnode->general_handoffs = 0;
+}
+
+static inline void handoff_local(struct mcs_spinlock *node,
+					       struct mcs_spinlock *next,
+					       u32 tail)
+{
+	static u16 max_u16 = (u16)(-1);
+
+	struct numa_qnode *qnode = (struct numa_qnode *)node;
+	struct numa_qnode *qnext = (struct numa_qnode *)next;
+
+	u16 general_handoffs = qnode->general_handoffs;
+
+	if (likely(general_handoffs + 1 != max_u16))
+		general_handoffs++;
+
+	qnext->general_handoffs = general_handoffs;
+
+	u16 wrong_fallback_tail = qnode->wrong_fallback_tail;
+
+	if (wrong_fallback_tail != 0 && wrong_fallback_tail != (tail >> _Q_TAIL_OFFSET)) {
+		qnext->numa_node = qnode->numa_node;
+		qnext->wrong_fallback_tail = wrong_fallback_tail;
+		qnext->lock_id = qnode->lock_id;
+	}
+
+	arch_mcs_spin_unlock_contended(&next->locked);
+}
+
+static inline void handoff_remote(struct qspinlock *lock,
+						struct numa_qnode *qnode,
+						u32 tail, int handoff_info)
+{
+	struct numa_queue *next_queue = NULL;
+	struct mcs_spinlock *mcs_head = NULL;
+	struct numa_qnode *qhead = NULL;
+	u16 lock_id = qnode->lock_id;
+
+	struct lock_metadata *lock_meta = get_meta(lock_id);
+	struct numa_queue *queue = get_local_queue(qnode);
+
+	u16 next_node_id;
+	u16 node_head, node_tail;
+
+	node_tail = READ_ONCE(lock_meta->tail_node);
+	node_head = READ_ONCE(lock_meta->head_node);
+
+	/*
+	 * 'handoffs_not_head > 0' means at the head of NUMA-level queue we have a node
+	 * which is heavily loaded and has performed a remote handoff upon reaching the threshold.
+	 *
+	 * Perform handoff to the head instead of next node in the NUMA-level queue,
+	 * if handoffs_not_head >= nr_online_nodes
+	 * (It means other contended nodes have been taking the lock at least once after the head one)
+	 */
+	u16 handoffs_not_head = READ_ONCE(queue->handoffs_not_head);
+
+	if (handoff_info > 0 && (handoffs_not_head < nr_online_nodes)) {
+		next_node_id = handoff_info;
+		if (node_head != qnode->numa_node)
+			handoffs_not_head++;
+	} else {
+		if (!node_head) {
+			/* If we're here - we have defintely other node-contenders, let's wait */
+			next_node_id = smp_cond_load_relaxed(&lock_meta->head_node, (VAL));
+		} else {
+			next_node_id = node_head;
+		}
+
+		handoffs_not_head = 0;
+	}
+
+	next_queue = get_queue(lock_id, next_node_id);
+	WRITE_ONCE(next_queue->handoffs_not_head, handoffs_not_head);
+
+	qhead = READ_ONCE(next_queue->head);
+
+	mcs_head = (void *) qhead;
+
+	/* arch_mcs_spin_unlock_contended implies smp-barrier */
+	arch_mcs_spin_unlock_contended(&mcs_head->locked);
+}
+
+static inline bool has_other_nodes(struct qspinlock *lock,
+				   struct numa_qnode *qnode)
+{
+	struct lock_metadata *lock_meta = get_meta(qnode->lock_id);
+
+	return lock_meta->tail_node != qnode->numa_node;
+}
+
+static inline bool is_node_threshold_reached(struct numa_qnode *qnode)
+{
+	return qnode->general_handoffs > hqlock_fairness_threshold;
+}
+
+static inline void hqlock_handoff(struct qspinlock *lock,
+					 struct mcs_spinlock *node,
+					 struct mcs_spinlock *next, u32 tail,
+					 int handoff_info)
+{
+	struct numa_qnode *qnode = (void *)node;
+	u16 lock_id = qnode->lock_id;
+	struct lock_metadata *lock_meta = get_meta(lock_id);
+	struct numa_queue *queue = get_local_queue(qnode);
+
+	if (handoff_info == HQLOCK_HANDOFF_LOCAL) {
+		if (!next)
+			next = smp_cond_load_relaxed(&node->next, (VAL));
+		WRITE_ONCE(queue->head, (void *) next);
+
+		bool threshold_expired = is_node_threshold_reached(qnode);
+
+		if (!threshold_expired || qnode->wrong_fallback_tail) {
+			handoff_local(node, next, tail);
+			return;
+		}
+
+		u16 queue_next = READ_ONCE(queue->next_node);
+		bool has_others = has_other_nodes(lock, qnode);
+
+		/*
+		 * This check is racy, but it's ok,
+		 * because we fallback to local node in the worst case
+		 * and do not call reset_handoff_counter.
+		 * Next local contender will perform remote handoff
+		 * after next queue is properly linked
+		 */
+		if (has_others) {
+			handoff_info =
+				queue_next > 0 ? queue_next : HQLOCK_HANDOFF_LOCAL;
+		} else {
+			handoff_info = HQLOCK_HANDOFF_REMOTE_HEAD;
+		}
+
+		if (handoff_info == HQLOCK_HANDOFF_LOCAL ||
+			(handoff_info == HQLOCK_HANDOFF_REMOTE_HEAD &&
+				READ_ONCE(lock_meta->head_node) == qnode->numa_node)) {
+			/*
+			 * No other nodes have come yet, so we can clean fairness counter
+			 */
+			if (handoff_info == HQLOCK_HANDOFF_REMOTE_HEAD)
+				reset_handoff_counter(qnode);
+			handoff_local(node, next, tail);
+			return;
+		}
+	}
+
+	handoff_remote(lock, qnode, tail, handoff_info);
+	reset_handoff_counter(qnode);
+}
diff --git a/kernel/locking/hqlock_meta.h b/kernel/locking/hqlock_meta.h
new file mode 100644
index 0000000000..5b54801326
--- /dev/null
+++ b/kernel/locking/hqlock_meta.h
@@ -0,0 +1,467 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _GEN_HQ_SPINLOCK_SLOWPATH
+#error "Do not include this file!"
+#endif
+
+/* Lock metadata pool */
+static struct lock_metadata *meta_pool;
+
+static inline struct lock_metadata *get_meta(u16 lock_id)
+{
+	return &meta_pool[lock_id];
+}
+
+static inline hqlock_mode_t set_lock_mode(struct qspinlock *lock, int __val, u16 lock_id)
+{
+	u32 val = (u32)__val;
+	u32 new_val = 0;
+	u32 lock_mode = encode_lock_mode(lock_id);
+
+	while (!(val & _Q_LOCK_MODE_MASK)) {
+		/*
+		 * We need wait until pending is gone.
+		 * Otherwise, clearing pending can erase a NUMA mode we will set here
+		 */
+		if (val & _Q_PENDING_VAL) {
+			val = atomic_cond_read_relaxed(&lock->val, !(VAL & _Q_PENDING_VAL));
+
+			if (val & _Q_LOCK_MODE_MASK)
+				return LOCK_NO_MODE;
+		}
+
+		/*
+		 * If we are enabling NUMA-awareness, we should keep previous value in lock->tail
+		 * in case of having contenders seen LOCK_MODE_QSPINLOCK and set their tails via xchg_tail
+		 * (They will restore it to _Q_LOCK_INVALID_TAIL later).
+		 * If we are setting LOCK_MODE_QSPINLOCK, remove _Q_LOCK_INVALID_TAIL
+		 */
+		if (lock_id != LOCK_ID_NONE)
+			new_val = val | lock_mode;
+		else
+			new_val = (val & ~_Q_LOCK_INVALID_TAIL) | lock_mode;
+
+		/*
+		 * If we're setting LOCK_MODE_HQLOCK, make sure all "seq_counter"
+		 * updates (per-queue, lock_meta) are observed before lock mode update.
+		 * Paired with smp_rmb() in setup_lock_mode().
+		 */
+		if (lock_id != LOCK_ID_NONE)
+			smp_wmb();
+
+		bool updated = atomic_try_cmpxchg_relaxed(&lock->val, &val, new_val);
+
+		if (updated) {
+			return (lock_id == LOCK_ID_NONE) ?
+				LOCK_MODE_QSPINLOCK : LOCK_MODE_HQLOCK;
+		}
+	}
+
+	return LOCK_NO_MODE;
+}
+
+static inline hqlock_mode_t set_mode_hqlock(struct qspinlock *lock, int val, u16 lock_id)
+{
+	return set_lock_mode(lock, val, lock_id);
+}
+
+static inline hqlock_mode_t set_mode_qspinlock(struct qspinlock *lock, int val)
+{
+	return set_lock_mode(lock, val, LOCK_ID_NONE);
+}
+
+/* Dynamic lock-mode conditions */
+static inline bool is_mode_hqlock(int val)
+{
+	return decode_lock_mode(val) == LOCK_MODE_HQLOCK;
+}
+
+static inline bool is_mode_qspinlock(int val)
+{
+	return decode_lock_mode(val) == LOCK_MODE_QSPINLOCK;
+}
+
+enum meta_status {
+	META_CONFLICT = 0,
+	META_GRABBED,
+	META_SHARED,
+};
+
+static inline enum meta_status grab_lock_meta(struct qspinlock *lock, u32 lock_id, u32 *seq)
+{
+	int nid, seq_counter;
+	struct numa_queue *queue;
+	struct qspinlock *old = READ_ONCE(meta_pool[lock_id].lock_ptr);
+
+	if (old && old != lock)
+		return META_CONFLICT;
+
+	if (old && old == lock)
+		return META_SHARED;
+
+	old = cmpxchg_acquire(&meta_pool[lock_id].lock_ptr, NULL, lock);
+	if (!old)
+		goto init_meta;
+
+	/* Hash-conflict */
+	if (old != lock)
+		return META_CONFLICT;
+
+	return META_SHARED;
+init_meta:
+	/*
+	 * Update allocations counter and set it to per-NUMA queues
+	 * to prevent upcomming contenders from parking on deallocated queues
+	 */
+	seq_counter = atomic_inc_return_relaxed(&meta_pool[lock_id].seq_counter);
+
+	/* Very unlikely we can overflow */
+	if (unlikely(seq_counter == 0))
+		seq_counter = atomic_inc_return_relaxed(&meta_pool[lock_id].seq_counter);
+
+	for_each_online_node(nid) {
+		queue = &queue_table[nid][lock_id];
+		WRITE_ONCE(queue->seq_counter, (u32)seq_counter);
+	}
+
+	*seq = seq_counter;
+	return META_GRABBED;
+}
+
+/*
+ * Try to setup current lock mode:
+ *
+ * LOCK_MODE_HQLOCK or fallback to default LOCK_MODE_QSPINLOCK
+ * if there's hash conflict with another lock in the system.
+ *
+ * In general the setup consists of grabbing lock-related metadata and
+ * publishing the mode in the global lock variable.
+ *
+ * For quick meta-lookup the pointer hashing is used.
+ *
+ * To identify "occupied/free" metadata record, we use "meta->lock_ptr"
+ * which is set to corresponding spinlock lock pointer or "NULL".
+ *
+ * The action sequence from initial state is the following:
+ *
+ * "Find lock-meta by hash" => "Occupy lock-meta" => publish "LOCK_MODE_HQLOCK" in
+ * global lock variable.
+ *
+ */
+static inline
+hqlock_mode_t setup_lock_mode(struct qspinlock *lock, u16 lock_id, u32 *meta_seq_counter)
+{
+	hqlock_mode_t mode;
+
+	do {
+		enum meta_status status;
+		int val = atomic_read(&lock->val);
+
+		if (is_mode_hqlock(val)) {
+			struct lock_metadata *lock_meta = get_meta(lock_id);
+			/*
+			 * The lock is currently in LOCK_MODE_HQLOCK, we need to make sure the
+			 * associated metadata isn't used by another lock.
+			 *
+			 * In the meanwhile several situations can occur:
+			 *
+			 * [Case 1] Another lock using the meta (hash-conflict)
+			 *
+			 * If "release + reallocate" of the meta happenned in the meanwhile,
+			 * we're guaranteed to observe lock-mode change in the "lock->val",
+			 * due to the following event ordering:
+			 *
+			 * [release_lock_meta]
+			 *	Clear lock mode in "lock->val", so we wouldn't
+			 *	observe LOCK_MODE_HQLOCK mode.
+			 *	=>
+			 *        [setup_lock_mode]
+			 *	    Update lock->seq_counter
+			 *
+			 * [Case 2] For exact same lock, some contender did "release + reallocate" of the meta
+			 *
+			 * Either We'll get newly set "seq_counter", or in the worst case, we'll get
+			 * outdated "seq_counter" fail in the CAS(queue) in the caller function.
+			 *
+			 * [Case 3] Meta is free, nobody using it
+			 * [Case 4] The lock mode is changed to LOCK_MODE_QSPINLOCK.
+			 */
+			int seq_counter = atomic_read(&lock_meta->seq_counter);
+
+			/*
+			 * "seq_counter" and "lock->val" should be read in program order.
+			 * Otherwise we might observe "seq_counter" updated on-behalf another lock.
+			 * Paired with smp_wmb() in set_lock_mode().
+			 */
+			smp_rmb();
+			val = atomic_read(&lock->val);
+
+			if (is_mode_hqlock(val)) {
+				*meta_seq_counter = (u32)seq_counter;
+				return LOCK_MODE_HQLOCK;
+			}
+			/*
+			 * [else] Here it can be 2 options:
+			 *
+			 * 1. Lock-meta is free, and nobody using it.
+			 *    In this case, we need to try occupying the meta and
+			 *    publish lock-mode LOCK_MODE_HQLOCK again.
+			 *
+			 * 2. Lock mode transitioned to LOCK_MODE_QSPINLOCK mode.
+			 */
+			continue;
+		} else if (is_mode_qspinlock(val)) {
+			return LOCK_MODE_QSPINLOCK;
+		}
+
+		/*
+		 * Trying to get temporary metadata "weak" ownership,
+		 * Three situations might happen:
+		 *
+		 * 1. Metadata isn't used by anyone
+		 *    Just take the ownership.
+		 *
+		 * 2. Metadata is already grabbed by one of the lock contenders.
+		 *
+		 * 3. Hash conflict: metadata is owned by another lock
+		 *    Give up, fallback to LOCK_MODE_QSPINLOCK.
+		 */
+		status = grab_lock_meta(lock, lock_id, meta_seq_counter);
+		if (status == META_SHARED) {
+			/*
+			 * Someone started publishing lock_id for us:
+			 * 1. We can catch the "LOCK_MODE_HQLOCK" mode quickly
+			 * 2. We can loop several times before we'll see "LOCK_MODE_HQLOCK" mode set.
+			 * (lightweight check)
+			 * 3. Another contender might be able to relase lock meta meanwhile.
+			 * Either we catch it in above "seq_counter" check, or we'll grab
+			 * lock meta first and try publishing lock_id.
+			 */
+			continue;
+		}
+
+		/* Setup the lock-mode */
+		if (status == META_GRABBED)
+			mode = set_mode_hqlock(lock, val, lock_id);
+		else if (status == META_CONFLICT)
+			mode = set_mode_qspinlock(lock, val);
+		else
+			BUG_ON(1);
+		/*
+		 * If we grabbed the meta but were unable to publish LOCK_MODE_HQLOCK
+		 * release it, just by resetting the pointer.
+		 */
+		if (status == META_GRABBED && mode != LOCK_MODE_HQLOCK) {
+			smp_store_release(&meta_pool[lock_id].lock_ptr, NULL);
+		}
+	} while (mode == LOCK_NO_MODE);
+
+	return mode;
+}
+
+static inline void release_lock_meta(struct qspinlock *lock,
+					struct lock_metadata *meta,
+				     struct numa_qnode *qnode)
+{
+	int nid;
+	struct numa_queue *queue;
+	bool cleared = false;
+	u32 upd_val = _Q_LOCKTYPE_HQ | _Q_LOCKED_VAL;
+	u16 lock_id = qnode->lock_id;
+	int seq_counter = atomic_read(&meta->seq_counter);
+
+	/*
+	 * Firstly, go across per-NUMA queues and set seq counter to 0,
+	 * it will prevent possible contenders, which haven't even queued locally,
+	 * from using already deoccupied metadata.
+	 *
+	 * We need to perform counter reset with CAS,
+	 * because local contenders (we didn't see them while try_clear_lock_tail and try_clear_queue_tail)
+	 * may have appeared while we were coming that point.
+	 *
+	 * If any CAS is not successful, it means someone has already queued locally,
+	 * in that case we should restore usability of all local queues
+	 * and return seq counter to every per-NUMA queue.
+	 *
+	 * If all CASes are successful, nobody will queue on this metadata's queues,
+	 * and we can free it and allow other locks to use it.
+	 */
+
+	/*
+	 * Before metadata release read every queue tail,
+	 * if we have at least one contender, don't do CASes and leave
+	 * (Reads are much faster and also prefetch local queue's cachelines)
+	 */
+	for_each_online_node(nid) {
+		struct numa_queue *queue = get_queue(lock_id, nid + 1);
+
+		if (READ_ONCE(queue->tail) != 0)
+			return;
+	}
+
+	for_each_online_node(nid) {
+		struct numa_queue *queue = get_queue(lock_id, nid + 1);
+
+		if (cmpxchg_relaxed(&queue->seq_counter_tail, encode_tc(0, seq_counter), 0)
+			!= encode_tc(0, seq_counter))
+			/* Some contender arrived - rollback */
+			goto do_rollback;
+	}
+
+	/*
+	 * We need wait until pending is gone.
+	 * Otherwise, clearing pending can erase a mode we will set here
+	 */
+	while (!cleared) {
+		u32 old_lock_val = atomic_cond_read_relaxed(&lock->val, !(VAL & _Q_PENDING_VAL));
+
+		cleared = atomic_try_cmpxchg_relaxed(&lock->val,
+				&old_lock_val, upd_val | (old_lock_val & _Q_TAIL_MASK));
+	}
+
+	/*
+	 * guarantee current seq counter is erased from every local queue
+	 * and lock mode has been updated before another lock can use metadata
+	 */
+	smp_store_release(&meta_pool[qnode->lock_id].lock_ptr, NULL);
+	return;
+
+do_rollback:
+	for_each_online_node(nid) {
+		queue = get_queue(lock_id, nid + 1);
+		WRITE_ONCE(queue->seq_counter, seq_counter);
+	}
+}
+
+/*
+ * Call it if we observe LOCK_MODE_QSPINLOCK.
+ *
+ * We can do generic xchg_tail in this case,
+ * if lock's mode has already been changed, we will get _Q_LOCK_INVALID_TAIL.
+ *
+ * If we have such a situation, we perform CAS cycle
+ * to restore _Q_LOCK_INVALID_TAIL or wait until lock's mode is LOCK_MODE_QSPINLOCK.
+ *
+ * All upcomming confused contenders will see valid tail.
+ * We will remember the last one before successful CAS and put its tail in local queue.
+ * During handoff we will notify them about mode change via qnext->wrong_fallback_tail
+ */
+static inline bool try_update_tail_qspinlock_mode(struct qspinlock *lock, u32 tail, u32 *old_tail, u32 *next_tail)
+{
+	/*
+	 * next_tail may be tail or last cpu from previous unsuccessful call
+	 * (highly unlikely, but still)
+	 */
+	u32 xchged_tail = xchg_tail(lock, *next_tail);
+
+	if (likely(xchged_tail != _Q_LOCK_INVALID_TAIL)) {
+		*old_tail = xchged_tail;
+		return true;
+	}
+
+	/*
+	 * If we got _Q_LOCK_INVALID_TAIL, it means lock was not in LOCK_MODE_QSPINLOCK.
+	 * In this case we should restore _Q_LOCK_INVALID_TAIL
+	 * and remember next contenders that got confused.
+	 * Later we will update lock's or local queue's tail to the last contender seen here.
+	 */
+	u32 val = atomic_read(&lock->val);
+
+	bool fixed = false;
+
+	while (!fixed) {
+		if (decode_lock_mode(val) == LOCK_MODE_QSPINLOCK) {
+			*old_tail = 0;
+			return true;
+		}
+
+		/*
+		 * CAS is needed here to catch possible lock mode change
+		 * from LOCK_MODE_HQLOCK to LOCK_MODE_QSPINLOCK in the meanwhile.
+		 * Thus preventing from publishing _Q_LOCK_INVALID_TAIL
+		 * when LOCK_MODE_QSPINLOCK is enabled.
+		 */
+		fixed = atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCK_INVALID_TAIL |
+				(val & (_Q_LOCKED_PENDING_MASK | _Q_LOCK_TYPE_MODE_MASK)));
+	}
+
+	if ((val & _Q_TAIL_MASK) != tail)
+		*next_tail = val & _Q_TAIL_MASK;
+
+	return false;
+}
+
+/*
+ * Call it if we observe LOCK_MODE_HQLOCK or LOCK_NO_MODE in the lock.
+ *
+ * Actions performed:
+ * - Call setup_lock_mode to set or read lock's mode,
+ *   read metadata's sequential counter for valid local queueing
+ * - CAS on union of local tail and meta_seq_counter
+ *   to guarantee metadata usage correctness.
+ *   Repeat from the beginning if fail.
+ * - If we are the first local contender,
+ *   update global tail with our NUMA node
+ */
+static inline bool try_update_tail_hqlock_mode(struct qspinlock *lock, u16 lock_id,
+				struct numa_qnode *qnode, u32 tail, u32 *next_tail, u32 *old_tail)
+{
+	u32 meta_seq_counter;
+	hqlock_mode_t mode;
+
+	struct numa_queue *queue;
+	u64 old_counter_tail;
+	bool updated_queue_tail = false;
+
+re_setup:
+	mode = setup_lock_mode(lock, lock_id, &meta_seq_counter);
+
+	if (mode == LOCK_MODE_QSPINLOCK)
+		return false;
+
+	queue = get_local_queue(qnode);
+
+	/*
+	 * While queueing locally, perform CAS cycle
+	 * on union of tail and meta_seq_counter.
+	 *
+	 * meta_seq_counter is taken from the lock metadata while allocation,
+	 * it's updated every time it's used by a next lock.
+	 * It shows that queue is used correctly
+	 * and metadata hasn't been deoccupied before we queued locally.
+	 */
+	old_counter_tail = READ_ONCE(queue->seq_counter_tail);
+
+	while (!updated_queue_tail &&
+		   decode_tc_counter(old_counter_tail) == meta_seq_counter) {
+		updated_queue_tail =
+			try_cmpxchg_relaxed(&queue->seq_counter_tail, &old_counter_tail,
+				encode_tc((*next_tail) >> _Q_TAIL_OFFSET, meta_seq_counter));
+	}
+
+	/* queue->seq_counter changed */
+	if (!updated_queue_tail)
+		goto re_setup;
+
+	/*
+	 * The condition means we tried to perform generic tail update in try_update_tail_qspinlock_mode,
+	 * but before we did it, lock type was changed.
+	 * Moreover, some contenders have come after us in LOCK_MODE_QSPINLOCK,
+	 * during handoff we must notify them that they are set in LOCK_MODE_HQLOCK in our node's local queue
+	 */
+	if (unlikely(*next_tail != tail))
+		qnode->wrong_fallback_tail = *next_tail >> _Q_TAIL_OFFSET;
+
+	*old_tail = decode_tc_tail(old_counter_tail);
+
+	if (!(*old_tail)) {
+		u16 prev_node_id;
+
+		init_queue(qnode);
+		prev_node_id = append_node_queue(lock_id, qnode->numa_node);
+		*old_tail = prev_node_id ? Q_NEW_NODE_QUEUE : 0;
+	} else {
+		*old_tail <<= _Q_TAIL_OFFSET;
+	}
+
+	return true;
+}
-- 
2.34.1



^ permalink raw reply related

* [RFC PATCH v3 5/7] kernel: introduce general hq-spinlock support
From: Fedorov Nikita @ 2026-04-15 16:44 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, hpa, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, bcm-kernel-feedback-list, Arnd Bergmann,
	Peter Zijlstra, Boqun Feng, Waiman Long, Darren Hart,
	Davidlohr Bueso, andrealmeid, Andrew Morton, David Hildenbrand,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, byungchul,
	Gregory Price, Ying Huang, Alistair Popple, Anatoly Stepanov
  Cc: Nikita Fedorov, linux-arm-kernel, linux-kernel, virtualization,
	linux-arch, linux-mm, guohanjun, wangkefeng.wang, weiyongjun1,
	yusongping, leijitang, artem.kuzin, kang.sun, chenjieping3
In-Reply-To: <20260415164459.2904963-1-fedorov.nikita@h-partners.com>

Integrate HQ spinlocks into the generic queued spinlock code.
It includes:
- Kernel config
- Required macros
- PV-locks support
- HQ-lock type/mode masking support in generic code
- Integration hq-spinlock into queued spinlock slowpath

Now hq-spinlock can be enabled with:
- `spin_lock_init_hq()` for dynamic locks
- `DEFINE_SPINLOCK_HQ()` for static locks

Co-developed-by: Anatoly Stepanov <stepanov.anatoly@huawei.com>
Signed-off-by: Anatoly Stepanov <stepanov.anatoly@huawei.com>
Co-developed-by: Nikita Fedorov <fedorov.nikita@h-partners.com>
Signed-off-by: Nikita Fedorov <fedorov.nikita@h-partners.com>
---
 arch/arm64/include/asm/qspinlock.h       | 37 ++++++++++++
 arch/x86/include/asm/hq-spinlock.h       | 34 +++++++++++
 arch/x86/include/asm/paravirt-spinlock.h |  3 +-
 arch/x86/include/asm/qspinlock.h         |  6 +-
 include/asm-generic/qspinlock.h          | 23 ++++---
 include/linux/spinlock.h                 | 26 ++++++++
 include/linux/spinlock_types.h           | 26 ++++++++
 include/linux/spinlock_types_raw.h       | 20 ++++++
 kernel/Kconfig.locks                     | 29 +++++++++
 kernel/locking/hqlock_core.h             | 77 ++++++++++++++++++++++++
 kernel/locking/qspinlock.c               | 65 +++++++++++++++++---
 kernel/locking/qspinlock.h               |  4 +-
 kernel/locking/spinlock_debug.c          | 20 ++++++
 mm/mempolicy.c                           |  4 ++
 14 files changed, 352 insertions(+), 22 deletions(-)
 create mode 100644 arch/arm64/include/asm/qspinlock.h
 create mode 100644 arch/x86/include/asm/hq-spinlock.h

diff --git a/arch/arm64/include/asm/qspinlock.h b/arch/arm64/include/asm/qspinlock.h
new file mode 100644
index 0000000000..5b8b1ca0f4
--- /dev/null
+++ b/arch/arm64/include/asm/qspinlock.h
@@ -0,0 +1,37 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_ARM64_QSPINLOCK_H
+#define _ASM_ARM64_QSPINLOCK_H
+
+#ifdef CONFIG_HQSPINLOCKS
+
+extern void hq_configure_spin_lock_slowpath(void);
+
+extern void (*hq_queued_spin_lock_slowpath)(struct qspinlock *lock, u32 val);
+extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+
+#define	queued_spin_unlock queued_spin_unlock
+/**
+ * queued_spin_unlock - release a queued spinlock
+ * @lock : Pointer to queued spinlock structure
+ *
+ * A smp_store_release() on the least-significant byte.
+ */
+static inline void native_queued_spin_unlock(struct qspinlock *lock)
+{
+	smp_store_release(&lock->locked, 0);
+}
+
+static inline void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
+{
+	hq_queued_spin_lock_slowpath(lock, val);
+}
+
+static inline void queued_spin_unlock(struct qspinlock *lock)
+{
+	native_queued_spin_unlock(lock);
+}
+#endif
+
+#include <asm-generic/qspinlock.h>
+
+#endif /* _ASM_ARM64_QSPINLOCK_H */
diff --git a/arch/x86/include/asm/hq-spinlock.h b/arch/x86/include/asm/hq-spinlock.h
new file mode 100644
index 0000000000..f4b088164b
--- /dev/null
+++ b/arch/x86/include/asm/hq-spinlock.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _ASM_X86_HQ_SPINLOCK_H
+#define _ASM_X86_HQ_SPINLOCK_H
+
+extern void hq_configure_spin_lock_slowpath(void);
+
+#ifndef CONFIG_PARAVIRT_SPINLOCKS
+extern void (*hq_queued_spin_lock_slowpath)(struct qspinlock *lock, u32 val);
+extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+
+#define	queued_spin_unlock queued_spin_unlock
+/**
+ * queued_spin_unlock - release a queued spinlock
+ * @lock : Pointer to queued spinlock structure
+ *
+ * A smp_store_release() on the least-significant byte.
+ */
+static inline void native_queued_spin_unlock(struct qspinlock *lock)
+{
+	smp_store_release(&lock->locked, 0);
+}
+
+static inline void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
+{
+	hq_queued_spin_lock_slowpath(lock, val);
+}
+
+static inline void queued_spin_unlock(struct qspinlock *lock)
+{
+	native_queued_spin_unlock(lock);
+}
+#endif // !CONFIG_PARAVIRT_SPINLOCKS
+
+#endif // _ASM_X86_HQ_SPINLOCK_H
diff --git a/arch/x86/include/asm/paravirt-spinlock.h b/arch/x86/include/asm/paravirt-spinlock.h
index 7beffcb08e..9c37d6b47b 100644
--- a/arch/x86/include/asm/paravirt-spinlock.h
+++ b/arch/x86/include/asm/paravirt-spinlock.h
@@ -134,7 +134,8 @@ static inline bool virt_spin_lock(struct qspinlock *lock)
  __retry:
 	val = atomic_read(&lock->val);
 
-	if (val || !atomic_try_cmpxchg(&lock->val, &val, _Q_LOCKED_VAL)) {
+	if ((val & ~_Q_SERVICE_MASK) ||
+	     !atomic_try_cmpxchg(&lock->val, &val, _Q_LOCKED_VAL | (val & _Q_SERVICE_MASK))) {
 		cpu_relax();
 		goto __retry;
 	}
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 25a1919542..13950221eb 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -11,6 +11,10 @@
 #include <asm/paravirt-spinlock.h>
 #endif
 
+#ifdef CONFIG_HQSPINLOCKS
+#include <asm/hq-spinlock.h>
+#endif
+
 #define _Q_PENDING_LOOPS	(1 << 9)
 
 #define queued_fetch_set_pending_acquire queued_fetch_set_pending_acquire
@@ -25,7 +29,7 @@ static __always_inline u32 queued_fetch_set_pending_acquire(struct qspinlock *lo
 	 */
 	val = GEN_BINARY_RMWcc(LOCK_PREFIX "btsl", lock->val.counter, c,
 			       "I", _Q_PENDING_OFFSET) * _Q_PENDING_VAL;
-	val |= atomic_read(&lock->val) & ~_Q_PENDING_MASK;
+	val |= atomic_read(&lock->val) & ~_Q_PENDING_VAL;
 
 	return val;
 }
diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index bf47cca2c3..af3ab0286e 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -54,7 +54,7 @@ static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
 	 * Any !0 state indicates it is locked, even if _Q_LOCKED_VAL
 	 * isn't immediately observable.
 	 */
-	return atomic_read(&lock->val);
+	return atomic_read(&lock->val) & ~_Q_SERVICE_MASK;
 }
 #endif
 
@@ -70,7 +70,7 @@ static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
  */
 static __always_inline int queued_spin_value_unlocked(struct qspinlock lock)
 {
-	return !lock.val.counter;
+	return !(lock.val.counter & ~_Q_SERVICE_MASK);
 }
 
 /**
@@ -80,8 +80,10 @@ static __always_inline int queued_spin_value_unlocked(struct qspinlock lock)
  */
 static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
 {
-	return atomic_read(&lock->val) & ~_Q_LOCKED_MASK;
+	return atomic_read(&lock->val) & ~(_Q_LOCKED_MASK | _Q_SERVICE_MASK);
 }
+
+#ifndef queued_spin_trylock
 /**
  * queued_spin_trylock - try to acquire the queued spinlock
  * @lock : Pointer to queued spinlock structure
@@ -91,11 +93,12 @@ static __always_inline int queued_spin_trylock(struct qspinlock *lock)
 {
 	int val = atomic_read(&lock->val);
 
-	if (unlikely(val))
+	if (unlikely(val & ~_Q_SERVICE_MASK))
 		return 0;
 
-	return likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL));
+	return likely(atomic_try_cmpxchg_acquire(&lock->val, &val, val | _Q_LOCKED_VAL));
 }
+#endif
 
 extern void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
 
@@ -106,14 +109,16 @@ extern void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
  */
 static __always_inline void queued_spin_lock(struct qspinlock *lock)
 {
-	int val = 0;
+	int val = atomic_read(&lock->val);
 
-	if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)))
-		return;
+	if (likely(!(val & ~_Q_SERVICE_MASK))) {
+		if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, val | _Q_LOCKED_VAL)))
+			return;
+	}
 
 	queued_spin_lock_slowpath(lock, val);
 }
-#endif
+#endif // queued_spin_lock
 
 #ifndef queued_spin_unlock
 /**
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index e1e2f144af..e953018934 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -100,6 +100,8 @@
 #ifdef CONFIG_DEBUG_SPINLOCK
   extern void __raw_spin_lock_init(raw_spinlock_t *lock, const char *name,
 				   struct lock_class_key *key, short inner);
+  extern void __raw_spin_lock_init_hq(raw_spinlock_t *lock, const char *name,
+				   struct lock_class_key *key, short inner);
 
 # define raw_spin_lock_init(lock)					\
 do {									\
@@ -108,9 +110,19 @@ do {									\
 	__raw_spin_lock_init((lock), #lock, &__key, LD_WAIT_SPIN);	\
 } while (0)
 
+# define raw_spin_lock_init_hq(lock)					\
+do {									\
+	static struct lock_class_key __key;				\
+									\
+	__raw_spin_lock_init_hq((lock), #lock, &__key, LD_WAIT_SPIN);	\
+} while (0)
+
 #else
 # define raw_spin_lock_init(lock)				\
 	do { *(lock) = __RAW_SPIN_LOCK_UNLOCKED(lock); } while (0)
+
+# define raw_spin_lock_init_hq(lock)				\
+	do { *(lock) = __RAW_SPIN_LOCK_UNLOCKED_HQ(lock); } while (0)
 #endif
 
 #define raw_spin_is_locked(lock)	arch_spin_is_locked(&(lock)->raw_lock)
@@ -325,6 +337,14 @@ do {								\
 			     #lock, &__key, LD_WAIT_CONFIG);	\
 } while (0)
 
+# define spin_lock_init_hq(lock)					\
+do {								\
+	static struct lock_class_key __key;			\
+								\
+	__raw_spin_lock_init_hq(spinlock_check(lock),		\
+			     #lock, &__key, LD_WAIT_CONFIG);	\
+} while (0)
+
 #else
 
 # define spin_lock_init(_lock)			\
@@ -333,6 +353,12 @@ do {						\
 	*(_lock) = __SPIN_LOCK_UNLOCKED(_lock);	\
 } while (0)
 
+# define spin_lock_init_hq(_lock)			\
+do {						\
+	spinlock_check(_lock);			\
+	*(_lock) = __SPIN_LOCK_UNLOCKED_HQ(_lock);	\
+} while (0)
+
 #endif
 
 static __always_inline void spin_lock(spinlock_t *lock)
diff --git a/include/linux/spinlock_types.h b/include/linux/spinlock_types.h
index b65bb6e445..ad68f6ad8d 100644
--- a/include/linux/spinlock_types.h
+++ b/include/linux/spinlock_types.h
@@ -43,6 +43,29 @@ typedef struct spinlock spinlock_t;
 
 #define DEFINE_SPINLOCK(x)	spinlock_t x = __SPIN_LOCK_UNLOCKED(x)
 
+#ifdef __ARCH_SPIN_LOCK_UNLOCKED_HQ
+#define ___SPIN_LOCK_INITIALIZER_HQ(lockname)	\
+	{					\
+	.raw_lock = __ARCH_SPIN_LOCK_UNLOCKED_HQ,	\
+	SPIN_DEBUG_INIT(lockname)		\
+	SPIN_DEP_MAP_INIT(lockname) }
+
+#else
+#define ___SPIN_LOCK_INITIALIZER_HQ(lockname)	\
+	{					\
+	.raw_lock = __ARCH_SPIN_LOCK_UNLOCKED,	\
+	SPIN_DEBUG_INIT(lockname)		\
+	SPIN_DEP_MAP_INIT(lockname) }
+#endif
+
+#define __SPIN_LOCK_INITIALIZER_HQ(lockname) \
+	{ { .rlock = ___SPIN_LOCK_INITIALIZER_HQ(lockname) } }
+
+#define __SPIN_LOCK_UNLOCKED_HQ(lockname) \
+	((spinlock_t) __SPIN_LOCK_INITIALIZER_HQ(lockname))
+
+#define DEFINE_SPINLOCK_HQ(x) spinlock_t x = __SPIN_LOCK_UNLOCKED_HQ(x)
+
 #else /* !CONFIG_PREEMPT_RT */
 
 /* PREEMPT_RT kernels map spinlock to rt_mutex */
@@ -71,6 +94,9 @@ typedef struct spinlock spinlock_t;
 #define DEFINE_SPINLOCK(name)					\
 	spinlock_t name = __SPIN_LOCK_UNLOCKED(name)
 
+#define DEFINE_SPINLOCK_HQ(name)					\
+	spinlock_t name = __SPIN_LOCK_UNLOCKED(name)
+
 #endif /* CONFIG_PREEMPT_RT */
 
 #include <linux/rwlock_types.h>
diff --git a/include/linux/spinlock_types_raw.h b/include/linux/spinlock_types_raw.h
index e5644ab216..0e4126a23b 100644
--- a/include/linux/spinlock_types_raw.h
+++ b/include/linux/spinlock_types_raw.h
@@ -71,4 +71,24 @@ typedef struct raw_spinlock raw_spinlock_t;
 
 #define DEFINE_RAW_SPINLOCK(x)  raw_spinlock_t x = __RAW_SPIN_LOCK_UNLOCKED(x)
 
+#ifdef __ARCH_SPIN_LOCK_UNLOCKED_HQ
+#define __RAW_SPIN_LOCK_INITIALIZER_HQ(lockname)	\
+{						\
+	.raw_lock = __ARCH_SPIN_LOCK_UNLOCKED_HQ,	\
+	SPIN_DEBUG_INIT(lockname)		\
+	RAW_SPIN_DEP_MAP_INIT(lockname) }
+
+#else
+#define __RAW_SPIN_LOCK_INITIALIZER_HQ(lockname)	\
+{						\
+	.raw_lock = __ARCH_SPIN_LOCK_UNLOCKED,	\
+	SPIN_DEBUG_INIT(lockname)		\
+	RAW_SPIN_DEP_MAP_INIT(lockname) }
+#endif
+
+#define __RAW_SPIN_LOCK_UNLOCKED_HQ(lockname)	\
+	((raw_spinlock_t) __RAW_SPIN_LOCK_INITIALIZER_HQ(lockname))
+
+#define DEFINE_RAW_SPINLOCK_HQ(x)  raw_spinlock_t x = __RAW_SPIN_LOCK_UNLOCKED_HQ(x)
+
 #endif /* __LINUX_SPINLOCK_TYPES_RAW_H */
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index 4198f0273e..c96f3a4551 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -243,6 +243,35 @@ config QUEUED_SPINLOCKS
 	def_bool y if ARCH_USE_QUEUED_SPINLOCKS
 	depends on SMP
 
+config HQSPINLOCKS
+	bool "(NUMA-aware) Hierarchical Queued spinlock"
+	depends on NUMA
+	depends on QUEUED_SPINLOCKS
+	depends on NR_CPUS < 16384
+	depends on !CPU_BIG_ENDIAN
+	help
+	  Introduce NUMA (Non Uniform Memory Access) awareness into
+	  the slow path of kernel's spinlocks.
+
+	  We preallocate 'LOCK_ID_MAX' lock_metadata structures with corresponing per-NUMA queues,
+	  and the first queueing contender finds metadata corresponding to its lock by lock's hash and occupies it.
+	  If the metadata is already occupied, perform fallback to default qspinlock approach.
+	  Upcomming contenders then exchange the tail of per-NUMA queue instead of global tail,
+	  and until threshold is reached, we pass the lock to the condenters from the local queue.
+	  At the global tail we keep NUMA nodes to maintain FIFO on per-node level.
+
+	  That approach helps to reduce cross-numa accesses and thus improve lock's performance
+	  in high-load scenarios on multi-NUMA node machines.
+
+	  Say N, if you want absolute first come first serve fairness.
+
+config HQSPINLOCKS_DEBUG
+	bool "Enable debug output for numa-aware spinlocks"
+	depends on HQSPINLOCKS
+	default n
+	help
+	  This option enables statistics for dynamic metadata allocation for HQspinlock
+
 config BPF_ARCH_SPINLOCK
 	bool
 
diff --git a/kernel/locking/hqlock_core.h b/kernel/locking/hqlock_core.h
index b7681915b4..74f92ac3d8 100644
--- a/kernel/locking/hqlock_core.h
+++ b/kernel/locking/hqlock_core.h
@@ -771,3 +771,80 @@ static inline void hqlock_handoff(struct qspinlock *lock,
 	handoff_remote(lock, qnode, tail, handoff_info);
 	reset_handoff_counter(qnode);
 }
+
+static void __init hqlock_alloc_global_queues(void)
+{
+	int nid;
+	unsigned long meta_pool_size, queues_size;
+
+	meta_pool_size = ALIGN(sizeof(struct lock_metadata), L1_CACHE_BYTES) * LOCK_ID_MAX;
+
+	pr_info("Init HQspinlock lock_metadata info: size = 0x%lx B\n",
+		meta_pool_size);
+
+	meta_pool = kvzalloc(meta_pool_size, GFP_KERNEL);
+
+	if (!meta_pool)
+		panic("HQspinlock lock_metadata metadata info: allocation failure.\n");
+
+	for (int i = 0; i < LOCK_ID_MAX; i++)
+		atomic_set(&meta_pool[i].seq_counter, 0);
+
+	queues_size = LOCK_ID_MAX * ALIGN(sizeof(struct numa_queue), L1_CACHE_BYTES);
+
+	pr_info("Init HQspinlock per-NUMA metadata (per-node size = 0x%lx B)\n",
+		queues_size);
+
+	for_each_node(nid) {
+		queue_table[nid] = kvzalloc_node(queues_size, GFP_KERNEL, nid);
+
+		if (!queue_table[nid])
+			panic("HQspinlock per-NUMA metadata: allocation failure for node %d.\n", nid);
+	}
+}
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#define hq_queued_spin_lock_slowpath	pv_ops_lock.queued_spin_lock_slowpath
+#else
+void (*hq_queued_spin_lock_slowpath)(struct qspinlock *lock, u32 val) =
+		native_queued_spin_lock_slowpath;
+EXPORT_SYMBOL(hq_queued_spin_lock_slowpath);
+#endif
+
+static int numa_spinlock_flag;
+
+static int __init numa_spinlock_setup(char *str)
+{
+	if (!strcmp(str, "auto")) {
+		numa_spinlock_flag = 0;
+		return 1;
+	} else if (!strcmp(str, "on")) {
+		numa_spinlock_flag = 1;
+		return 1;
+	} else if (!strcmp(str, "off")) {
+		numa_spinlock_flag = -1;
+		return 1;
+	}
+
+	return 0;
+}
+__setup("numa_spinlock=", numa_spinlock_setup);
+
+void __hq_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+
+void __init hq_configure_spin_lock_slowpath(void)
+{
+	if (numa_spinlock_flag < 0)
+		return;
+
+	if (numa_spinlock_flag == 0 && (nr_node_ids < 2 ||
+		    hq_queued_spin_lock_slowpath !=
+			native_queued_spin_lock_slowpath)) {
+		return;
+	}
+
+	numa_spinlock_flag = 1;
+	hq_queued_spin_lock_slowpath = __hq_queued_spin_lock_slowpath;
+	pr_info("Enabling HQspinlock\n");
+	hqlock_alloc_global_queues();
+}
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index af8d122bb6..7feab0046e 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -11,7 +11,7 @@
  *          Peter Zijlstra <peterz@infradead.org>
  */
 
-#ifndef _GEN_PV_LOCK_SLOWPATH
+#if !defined(_GEN_PV_LOCK_SLOWPATH) && !defined(_GEN_HQ_SPINLOCK_SLOWPATH)
 
 #include <linux/smp.h>
 #include <linux/bug.h>
@@ -100,7 +100,7 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
 #define pv_kick_node		__pv_kick_node
 #define pv_wait_head_or_lock	__pv_wait_head_or_lock
 
-#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#if defined(CONFIG_PARAVIRT_SPINLOCKS) || defined(CONFIG_HQSPINLOCKS)
 #define queued_spin_lock_slowpath	native_queued_spin_lock_slowpath
 #endif
 
@@ -133,6 +133,11 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	u32 old, tail;
 	int idx;
 
+#if defined(_GEN_HQ_SPINLOCK_SLOWPATH) && !defined(_GEN_PV_LOCK_SLOWPATH)
+	bool is_numa_lock = val & _Q_LOCKTYPE_MASK;
+	bool numa_awareness_on = is_numa_lock && !(val & _Q_LOCK_MODE_QSPINLOCK_VAL);
+#endif
+
 	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
 
 	if (pv_enabled())
@@ -147,16 +152,16 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 *
 	 * 0,1,0 -> 0,0,1
 	 */
-	if (val == _Q_PENDING_VAL) {
+	if ((val & ~_Q_SERVICE_MASK) == _Q_PENDING_VAL) {
 		int cnt = _Q_PENDING_LOOPS;
 		val = atomic_cond_read_relaxed(&lock->val,
-					       (VAL != _Q_PENDING_VAL) || !cnt--);
+						((VAL & ~_Q_SERVICE_MASK) != _Q_PENDING_VAL) || !cnt--);
 	}
 
 	/*
 	 * If we observe any contention; queue.
 	 */
-	if (val & ~_Q_LOCKED_MASK)
+	if (val & ~(_Q_LOCKED_MASK | _Q_SERVICE_MASK))
 		goto queue;
 
 	/*
@@ -173,11 +178,15 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 * n,0,0 -> 0,0,0 transition fail and it will now be waiting
 	 * on @next to become !NULL.
 	 */
-	if (unlikely(val & ~_Q_LOCKED_MASK)) {
-
+	if (unlikely(val & ~(_Q_LOCKED_MASK | _Q_SERVICE_MASK))) {
 		/* Undo PENDING if we set it. */
-		if (!(val & _Q_PENDING_MASK))
+		if (!(val & _Q_PENDING_VAL)) {
+#if defined(_GEN_HQ_SPINLOCK_SLOWPATH) && !defined(_GEN_PV_LOCK_SLOWPATH)
+			hqlock_clear_pending(lock, val);
+#else
 			clear_pending(lock);
+#endif
+		}
 
 		goto queue;
 	}
@@ -194,14 +203,18 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 * barriers.
 	 */
 	if (val & _Q_LOCKED_MASK)
-		smp_cond_load_acquire(&lock->locked, !VAL);
+		smp_cond_load_acquire(&lock->locked_pending, !(VAL & _Q_LOCKED_MASK));
 
 	/*
 	 * take ownership and clear the pending bit.
 	 *
 	 * 0,1,0 -> 0,0,1
 	 */
+#if defined(_GEN_HQ_SPINLOCK_SLOWPATH) && !defined(_GEN_PV_LOCK_SLOWPATH)
+	hqlock_clear_pending_set_locked(lock, val);
+#else
 	clear_pending_set_locked(lock);
+#endif
 	lockevent_inc(lock_pending);
 	return;
 
@@ -274,7 +287,17 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 *
 	 * p,*,* -> n,*,*
 	 */
+#if defined(_GEN_HQ_SPINLOCK_SLOWPATH) && !defined(_GEN_PV_LOCK_SLOWPATH)
+	if (is_numa_lock)
+		old = hqlock_xchg_tail(lock, tail, node, &numa_awareness_on);
+	else
+		old = xchg_tail(lock, tail);
+
+	if (numa_awareness_on && old == Q_NEW_NODE_QUEUE)
+		goto mcs_spin;
+#else
 	old = xchg_tail(lock, tail);
+#endif
 	next = NULL;
 
 	/*
@@ -288,6 +311,9 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		WRITE_ONCE(prev->next, node);
 
 		pv_wait_node(node, prev);
+#if defined(_GEN_HQ_SPINLOCK_SLOWPATH) && !defined(_GEN_PV_LOCK_SLOWPATH)
+mcs_spin:
+#endif
 		arch_mcs_spin_lock_contended(&node->locked);
 
 		/*
@@ -349,6 +375,12 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 *       above wait condition, therefore any concurrent setting of
 	 *       PENDING will make the uncontended transition fail.
 	 */
+#if defined(_GEN_HQ_SPINLOCK_SLOWPATH) && !defined(_GEN_PV_LOCK_SLOWPATH)
+	if (is_numa_lock) {
+		hqlock_clear_tail_handoff(lock, val, tail, node, next, prev, numa_awareness_on);
+		goto release;
+	}
+#endif
 	if ((val & _Q_TAIL_MASK) == tail) {
 		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
 			goto release; /* No contention */
@@ -380,6 +412,21 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 }
 EXPORT_SYMBOL(queued_spin_lock_slowpath);
 
+/* Generate the code for NUMA-aware qspinlock (HQspinlock) */
+#if !defined(_GEN_HQ_SPINLOCK_SLOWPATH) && defined(CONFIG_HQSPINLOCKS)
+#define _GEN_HQ_SPINLOCK_SLOWPATH
+
+#undef pv_init_node
+#define pv_init_node		hqlock_init_node
+
+#undef  queued_spin_lock_slowpath
+#include "hqlock_core.h"
+#define queued_spin_lock_slowpath	__hq_queued_spin_lock_slowpath
+
+#include "qspinlock.c"
+
+#endif
+
 /*
  * Generate the paravirt code for queued_spin_unlock_slowpath().
  */
diff --git a/kernel/locking/qspinlock.h b/kernel/locking/qspinlock.h
index d69958a844..9933e0e666 100644
--- a/kernel/locking/qspinlock.h
+++ b/kernel/locking/qspinlock.h
@@ -39,7 +39,7 @@
  */
 struct qnode {
 	struct mcs_spinlock mcs;
-#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#if defined(CONFIG_PARAVIRT_SPINLOCKS) || defined(CONFIG_HQSPINLOCKS)
 	long reserved[2];
 #endif
 };
@@ -74,7 +74,7 @@ struct mcs_spinlock *grab_mcs_node(struct mcs_spinlock *base, int idx)
 	return &((struct qnode *)base + idx)->mcs;
 }
 
-#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
+#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_VAL)
 
 #if _Q_PENDING_BITS == 8
 /**
diff --git a/kernel/locking/spinlock_debug.c b/kernel/locking/spinlock_debug.c
index 2338b3adfb..f31d13cbc5 100644
--- a/kernel/locking/spinlock_debug.c
+++ b/kernel/locking/spinlock_debug.c
@@ -30,6 +30,26 @@ void __raw_spin_lock_init(raw_spinlock_t *lock, const char *name,
 	lock->owner_cpu = -1;
 }
 
+void __raw_spin_lock_init_hq(raw_spinlock_t *lock, const char *name,
+			  struct lock_class_key *key, short inner)
+{
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	/*
+	 * Make sure we are not reinitializing a held lock:
+	 */
+	debug_check_no_locks_freed((void *)lock, sizeof(*lock));
+	lockdep_init_map_wait(&lock->dep_map, name, key, 0, inner);
+#endif
+#ifdef __ARCH_SPIN_LOCK_UNLOCKED_HQ
+	lock->raw_lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED_HQ;
+#else
+	lock->raw_lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
+#endif
+	lock->magic = SPINLOCK_MAGIC;
+	lock->owner = SPINLOCK_OWNER_INIT;
+	lock->owner_cpu = -1;
+}
+
 EXPORT_SYMBOL(__raw_spin_lock_init);
 
 #ifndef CONFIG_PREEMPT_RT
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index cf92bd6a82..e2bc6646ce 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -3383,6 +3383,10 @@ void __init numa_policy_init(void)
 		pr_err("%s: interleaving failed\n", __func__);
 
 	check_numabalancing_enable();
+
+#ifdef CONFIG_HQSPINLOCKS
+	hq_configure_spin_lock_slowpath();
+#endif
 }
 
 /* Reset policy of current process to default */
-- 
2.34.1



^ permalink raw reply related

* [RFC PATCH v3 7/7] futex: use hq-spinlock for hash buckets
From: Fedorov Nikita @ 2026-04-15 16:44 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, hpa, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, bcm-kernel-feedback-list, Arnd Bergmann,
	Peter Zijlstra, Boqun Feng, Waiman Long, Darren Hart,
	Davidlohr Bueso, andrealmeid, Andrew Morton, David Hildenbrand,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, byungchul,
	Gregory Price, Ying Huang, Alistair Popple, Anatoly Stepanov
  Cc: Nikita Fedorov, linux-arm-kernel, linux-kernel, virtualization,
	linux-arch, linux-mm, guohanjun, wangkefeng.wang, weiyongjun1,
	yusongping, leijitang, artem.kuzin, kang.sun, chenjieping3
In-Reply-To: <20260415164459.2904963-1-fedorov.nikita@h-partners.com>

Example of hq-spinlock enabled for futex hash-table bucket locks
(used in memcached testing scenario)

In the evaluated memcached workloads, this improved throughput by up to
10% on AMD EPYC 9654 and by up to 8% on Kunpeng 920, with corresponding
latency reductions.

Co-developed-by: Anatoly Stepanov <stepanov.anatoly@huawei.com>
Signed-off-by: Anatoly Stepanov <stepanov.anatoly@huawei.com>
Co-developed-by: Nikita Fedorov <fedorov.nikita@h-partners.com>
Signed-off-by: Nikita Fedorov <fedorov.nikita@h-partners.com>
---
 kernel/futex/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 31e83a0978..042cdc7f75 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1521,7 +1521,7 @@ static void futex_hash_bucket_init(struct futex_hash_bucket *fhb,
 #endif
 	atomic_set(&fhb->waiters, 0);
 	plist_head_init(&fhb->chain);
-	spin_lock_init(&fhb->lock);
+	spin_lock_init_hq(&fhb->lock);
 }
 
 #define FH_CUSTOM	0x01
-- 
2.34.1



^ permalink raw reply related

* [RFC PATCH v3 3/7] hq-spinlock: add contention detection
From: Fedorov Nikita @ 2026-04-15 16:44 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, hpa, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, bcm-kernel-feedback-list, Arnd Bergmann,
	Peter Zijlstra, Boqun Feng, Waiman Long, Darren Hart,
	Davidlohr Bueso, andrealmeid, Andrew Morton, David Hildenbrand,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, byungchul,
	Gregory Price, Ying Huang, Alistair Popple, Anatoly Stepanov
  Cc: Nikita Fedorov, linux-arm-kernel, linux-kernel, virtualization,
	linux-arch, linux-mm, guohanjun, wangkefeng.wang, weiyongjun1,
	yusongping, leijitang, artem.kuzin, kang.sun, chenjieping3
In-Reply-To: <20260415164459.2904963-1-fedorov.nikita@h-partners.com>

The hierarchical slowpath is needed for locks that experience
sustained cross-node contention. Enabling it unconditionally is
undesirable for lightly contended locks and may decrease the performance.

Add a simple contention detection scheme that tracks remote handoffs
separately from overall handoff activity and enables HQ mode only when
the observed handoff pattern indicates that cross-node contention is
high enough to benefit from NUMA-aware queueing.

HQ lock type can be turned on if remote_handoffs exceeds `hqlock_remote_handoffs_turn_numa`.
Lock can be turned back into QSPINLOCK mode if in HQ mode
amount of remote handoffs will not exceed `hqlock_remote_handoffs_keep_numa`.

Remote handoffs counter will increase only if the amount of local handoffs after previous increase
is not less than `hqlock_local_handoffs_to_increase_remotes`.

Additional locktorture reruns showed no degradation
in low-contention configurations after adding contention-based switching
while maintaining practically the same performance improvement in high contention cases.

Co-developed-by: Anatoly Stepanov <stepanov.anatoly@huawei.com>
Signed-off-by: Anatoly Stepanov <stepanov.anatoly@huawei.com>
Co-developed-by: Nikita Fedorov <fedorov.nikita@h-partners.com>
Signed-off-by: Nikita Fedorov <fedorov.nikita@h-partners.com>
---
 kernel/locking/hqlock_core.h  | 57 +++++++++++++++++++++++++++++++++--
 kernel/locking/hqlock_meta.h  |  4 +++
 kernel/locking/hqlock_types.h |  8 +++--
 3 files changed, 65 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/hqlock_core.h b/kernel/locking/hqlock_core.h
index 7322199228..e2ba09d758 100644
--- a/kernel/locking/hqlock_core.h
+++ b/kernel/locking/hqlock_core.h
@@ -450,6 +450,23 @@ static inline void hqlock_handoff(struct qspinlock *lock,
 					 struct mcs_spinlock *next, u32 tail,
 					 int handoff_info);
 
+/*
+ * In low_contention_mcs_lock_handoff we wanted to help processor optimise writes
+ * and avoid extra reading of our cpu cacheline (read our qnode->numa_node),
+ * so previous contender has saved his numa node in our prev_numa_node,
+ * and now we need to update remote_handoffs counter by ourself
+ */
+static __always_inline void update_counters_qspinlock(struct numa_qnode *qnode)
+{
+	if (qnode->numa_node != qnode->prev_numa_node) {
+		if ((qnode->general_handoffs - qnode->prev_general_handoffs)
+		    > hqlock_local_handoffs_to_increase_remotes) {
+			qnode->remote_handoffs++;
+		}
+
+		qnode->prev_general_handoffs = qnode->general_handoffs;
+	}
+}
 
 /*
  * Chech if contention has risen and if we need to set NUMA-aware mode
@@ -458,8 +475,13 @@ static __always_inline bool determine_contention_qspinlock_mode(struct mcs_spinl
 {
 	struct numa_qnode *qnode = (void *)node;
 
-	if (qnode->general_handoffs > READ_ONCE(hqlock_general_handoffs_turn_numa))
+	unsigned long general_handoffs = (unsigned long) qnode->general_handoffs;
+	unsigned long remote_handoffs = (unsigned long) qnode->remote_handoffs;
+
+	if ((general_handoffs > hqlock_general_handoffs_turn_numa) &&
+		(remote_handoffs > hqlock_remote_handoffs_turn_numa))
 		return true;
+
 	return false;
 }
 
@@ -485,7 +507,14 @@ static __always_inline bool low_contention_try_clear_tail(struct qspinlock *lock
 	else
 		update_val |= _Q_LOCK_INVALID_TAIL;
 
-	return atomic_try_cmpxchg_relaxed(&lock->val, &val, update_val);
+	bool ret = atomic_try_cmpxchg_relaxed(&lock->val, &val, update_val);
+
+#ifdef CONFIG_HQSPINLOCKS_DEBUG
+	if (ret && high_contention)
+		atomic_inc(&transitions_from_qspinlock_to_hq);
+#endif
+
+	return ret;
 }
 
 static __always_inline void low_contention_mcs_lock_handoff(struct mcs_spinlock *node,
@@ -502,6 +531,17 @@ static __always_inline void low_contention_mcs_lock_handoff(struct mcs_spinlock
 		general_handoffs++;
 
 	qnext->general_handoffs = general_handoffs;
+	qnext->remote_handoffs = qnode->remote_handoffs;
+	qnext->prev_general_handoffs = qnode->prev_general_handoffs;
+
+	/*
+	 * Show next contender our numa node and assume
+	 * he will update remote_handoffs counter in update_counters_qspinlock by himself
+	 * instead of reading his numa_node and updating remote_handoffs here
+	 * to avoid extra cacheline transferring and help processor optimise several writes here
+	 */
+	qnext->prev_numa_node = qnode->numa_node;
+
 	arch_mcs_spin_unlock_contended(&next->locked);
 }
 
@@ -557,6 +597,10 @@ static inline void hqlock_init_node(struct mcs_spinlock *node)
 	qnode->numa_node = numa_node_id() + 1;
 	qnode->lock_id = 0;
 	qnode->wrong_fallback_tail = 0;
+
+	qnode->remote_handoffs = 0;
+	qnode->prev_numa_node = 0;
+	qnode->prev_general_handoffs = 0;
 }
 
 static inline void reset_handoff_counter(struct numa_qnode *qnode)
@@ -580,6 +624,8 @@ static inline void handoff_local(struct mcs_spinlock *node,
 
 	qnext->general_handoffs = general_handoffs;
 
+	qnext->remote_handoffs = qnode->remote_handoffs;
+
 	u16 wrong_fallback_tail = qnode->wrong_fallback_tail;
 
 	if (wrong_fallback_tail != 0 && wrong_fallback_tail != (tail >> _Q_TAIL_OFFSET)) {
@@ -641,6 +687,13 @@ static inline void handoff_remote(struct qspinlock *lock,
 
 	mcs_head = (void *) qhead;
 
+	u16 remote_handoffs = qnode->remote_handoffs;
+
+	if (qnode->general_handoffs > hqlock_local_handoffs_to_increase_remotes)
+		remote_handoffs++;
+
+	qhead->remote_handoffs = remote_handoffs;
+
 	/* arch_mcs_spin_unlock_contended implies smp-barrier */
 	arch_mcs_spin_unlock_contended(&mcs_head->locked);
 }
diff --git a/kernel/locking/hqlock_meta.h b/kernel/locking/hqlock_meta.h
index 5b54801326..561d5a5fd0 100644
--- a/kernel/locking/hqlock_meta.h
+++ b/kernel/locking/hqlock_meta.h
@@ -307,6 +307,10 @@ static inline void release_lock_meta(struct qspinlock *lock,
 			goto do_rollback;
 	}
 
+	if (qnode->remote_handoffs < hqlock_remote_handoffs_keep_numa) {
+		upd_val |= _Q_LOCK_MODE_QSPINLOCK_VAL;
+	}
+
 	/*
 	 * We need wait until pending is gone.
 	 * Otherwise, clearing pending can erase a mode we will set here
diff --git a/kernel/locking/hqlock_types.h b/kernel/locking/hqlock_types.h
index 32d06f2755..40061f11a1 100644
--- a/kernel/locking/hqlock_types.h
+++ b/kernel/locking/hqlock_types.h
@@ -37,9 +37,13 @@ struct numa_qnode {
 
 	u16 lock_id;
 	u16 wrong_fallback_tail;
-	u16 general_handoffs;
-
 	u16 numa_node;
+
+
+	u16 general_handoffs;
+	u16 remote_handoffs;
+	u16 prev_general_handoffs;
+	u16 prev_numa_node;
 };
 
 struct numa_queue {
-- 
2.34.1



^ permalink raw reply related

* [RFC PATCH v3 6/7] lockref: use hq-spinlock
From: Fedorov Nikita @ 2026-04-15 16:44 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, hpa, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, bcm-kernel-feedback-list, Arnd Bergmann,
	Peter Zijlstra, Boqun Feng, Waiman Long, Darren Hart,
	Davidlohr Bueso, andrealmeid, Andrew Morton, David Hildenbrand,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, byungchul,
	Gregory Price, Ying Huang, Alistair Popple, Anatoly Stepanov
  Cc: Nikita Fedorov, linux-arm-kernel, linux-kernel, virtualization,
	linux-arch, linux-mm, guohanjun, wangkefeng.wang, weiyongjun1,
	yusongping, leijitang, artem.kuzin, kang.sun, chenjieping3
In-Reply-To: <20260415164459.2904963-1-fedorov.nikita@h-partners.com>

Example of hq-spinlock enabled for dentry->lockref spinlock
(used in nginx testing scenario)

In the evaluated nginx single-file workload on Kunpeng 920, throughput
gains reached 68-78% at 64-96 workers.

Co-developed-by: Anatoly Stepanov <stepanov.anatoly@huawei.com>
Signed-off-by: Anatoly Stepanov <stepanov.anatoly@huawei.com>
Co-developed-by: Nikita Fedorov <fedorov.nikita@h-partners.com>
Signed-off-by: Nikita Fedorov <fedorov.nikita@h-partners.com>
---
 include/linux/lockref.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/lockref.h b/include/linux/lockref.h
index 6ded24cdb4..19a1c3823c 100644
--- a/include/linux/lockref.h
+++ b/include/linux/lockref.h
@@ -42,7 +42,7 @@ struct lockref {
  */
 static inline void lockref_init(struct lockref *lockref)
 {
-	spin_lock_init(&lockref->lock);
+	spin_lock_init_hq(&lockref->lock);
 	lockref->count = 1;
 }
 
-- 
2.34.1



^ permalink raw reply related

* [RFC PATCH v3 1/7] kernel: add hq-spinlock types
From: Fedorov Nikita @ 2026-04-15 16:44 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, hpa, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, bcm-kernel-feedback-list, Arnd Bergmann,
	Peter Zijlstra, Boqun Feng, Waiman Long, Darren Hart,
	Davidlohr Bueso, andrealmeid, Andrew Morton, David Hildenbrand,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, byungchul,
	Gregory Price, Ying Huang, Alistair Popple, Anatoly Stepanov
  Cc: Nikita Fedorov, linux-arm-kernel, linux-kernel, virtualization,
	linux-arch, linux-mm, guohanjun, wangkefeng.wang, weiyongjun1,
	yusongping, leijitang, artem.kuzin, kang.sun, chenjieping3
In-Reply-To: <20260415164459.2904963-1-fedorov.nikita@h-partners.com>

Introduce the type definitions used by Hierarchical Queued spinlocks.

Add the HQ-specific types definitions and extend qspinlock so
that a lock can be marked as using the HQ slowpath
while preserving the existing qspinlock layout

Co-developed-by: Anatoly Stepanov <stepanov.anatoly@huawei.com>
Signed-off-by: Anatoly Stepanov <stepanov.anatoly@huawei.com>
Co-developed-by: Nikita Fedorov <fedorov.nikita@h-partners.com>
Signed-off-by: Nikita Fedorov <fedorov.nikita@h-partners.com>
---
 include/asm-generic/qspinlock_types.h |  44 ++++++++--
 kernel/locking/hqlock_types.h         | 118 ++++++++++++++++++++++++++
 2 files changed, 157 insertions(+), 5 deletions(-)
 create mode 100644 kernel/locking/hqlock_types.h

diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
index 2fd1fb89ec..a97387ae48 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -43,11 +43,6 @@ typedef struct qspinlock {
 	};
 } arch_spinlock_t;
 
-/*
- * Initializier
- */
-#define	__ARCH_SPIN_LOCK_UNLOCKED	{ { .val = ATOMIC_INIT(0) } }
-
 /*
  * Bitfields in the atomic value:
  *
@@ -76,6 +71,26 @@ typedef struct qspinlock {
 #else
 #define _Q_PENDING_BITS		1
 #endif
+
+#ifdef CONFIG_HQSPINLOCKS
+/* For locks with HQ-mode we always use single pending bit */
+#define _Q_PENDING_HQLOCK_BITS	 1
+#define _Q_PENDING_HQLOCK_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#define _Q_PENDING_HQLOCK_MASK		_Q_SET_MASK(PENDING_HQLOCK)
+
+#define _Q_LOCKTYPE_OFFSET	(_Q_PENDING_HQLOCK_OFFSET + _Q_PENDING_HQLOCK_BITS)
+#define _Q_LOCKTYPE_BITS	1
+#define _Q_LOCKTYPE_MASK	_Q_SET_MASK(LOCKTYPE)
+#define _Q_LOCKTYPE_HQ			(1U << _Q_LOCKTYPE_OFFSET)
+
+#define _Q_LOCK_MODE_OFFSET	(_Q_LOCKTYPE_OFFSET + _Q_LOCKTYPE_BITS)
+#define _Q_LOCK_MODE_BITS	2
+#define _Q_LOCK_MODE_MASK	_Q_SET_MASK(LOCK_MODE)
+#define _Q_LOCK_MODE_QSPINLOCK_VAL	(1U << _Q_LOCK_MODE_OFFSET)
+
+#define _Q_LOCK_TYPE_MODE_MASK	(_Q_LOCKTYPE_MASK | _Q_LOCK_MODE_MASK)
+#endif //CONFIG_HQSPINLOCKS
+
 #define _Q_PENDING_MASK		_Q_SET_MASK(PENDING)
 
 #define _Q_TAIL_IDX_OFFSET	(_Q_PENDING_OFFSET + _Q_PENDING_BITS)
@@ -92,4 +107,23 @@ typedef struct qspinlock {
 #define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
 #define _Q_PENDING_VAL		(1U << _Q_PENDING_OFFSET)
 
+#ifdef CONFIG_HQSPINLOCKS
+#define _Q_LOCK_INVALID_TAIL	(_Q_TAIL_IDX_MASK)
+
+#define _Q_SERVICE_MASK		(_Q_LOCKTYPE_MASK | _Q_LOCK_MODE_QSPINLOCK_VAL | _Q_LOCK_INVALID_TAIL)
+#else // CONFIG_HQSPINLOCKS
+#define _Q_SERVICE_MASK		0
+#endif
+
+/*
+ * Initializier
+ */
+#define	__ARCH_SPIN_LOCK_UNLOCKED		{ { .val = ATOMIC_INIT(0) } }
+
+#ifdef CONFIG_HQSPINLOCKS
+#define	__ARCH_SPIN_LOCK_UNLOCKED_HQ	{ { .val = ATOMIC_INIT(_Q_LOCKTYPE_HQ | _Q_LOCK_MODE_QSPINLOCK_VAL) } }
+#else
+#define	__ARCH_SPIN_LOCK_UNLOCKED_HQ	{ { .val = ATOMIC_INIT(0) } }
+#endif
+
 #endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
diff --git a/kernel/locking/hqlock_types.h b/kernel/locking/hqlock_types.h
new file mode 100644
index 0000000000..32d06f2755
--- /dev/null
+++ b/kernel/locking/hqlock_types.h
@@ -0,0 +1,118 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _GEN_HQ_SPINLOCK_SLOWPATH
+#error "Do not include this file!"
+#endif
+
+#define IRQ_NODE (MAX_NUMNODES + 1)
+#define Q_NEW_NODE_QUEUE 1
+#define LOCK_ID_BITS		(12)
+
+#define LOCK_ID_MAX			(1 << LOCK_ID_BITS)
+#define LOCK_ID_NONE		(LOCK_ID_MAX + 1)
+
+#define _QUEUE_TAIL_MASK (((1ULL << 32) - 1) << 0)
+#define _QUEUE_SEQ_COUNTER_MASK (((1ULL << 32) - 1) << 32)
+
+/*
+ * Output code for handoff-logic:
+ *
+ * ==  0 (HQLOCK_HANDOFF_LOCAL) - has local nodes to handoff
+ *  >  0 - has remote node to handoff, id is visible
+ * == -1 (HQLOCK_HANDOFF_REMOTE_HEAD) - has remote node to handoff,
+ *         id isn't visible yet, will be in the *lock_meta->head_node*
+ */
+enum {
+	HQLOCK_HANDOFF_LOCAL = 0,
+	HQLOCK_HANDOFF_REMOTE_HEAD = -1
+};
+
+typedef enum {
+	LOCK_NO_MODE = 0,
+	LOCK_MODE_QSPINLOCK = 1,
+	LOCK_MODE_HQLOCK = 2,
+} hqlock_mode_t;
+
+struct numa_qnode {
+	struct mcs_spinlock mcs;
+
+	u16 lock_id;
+	u16 wrong_fallback_tail;
+	u16 general_handoffs;
+
+	u16 numa_node;
+};
+
+struct numa_queue {
+	struct numa_qnode *head;
+	union {
+		u64 seq_counter_tail;
+		struct {
+			u32 tail;
+			u32 seq_counter;
+		};
+	};
+
+	u16 next_node;
+	u16 prev_node;
+
+	u16 handoffs_not_head;
+} ____cacheline_aligned;
+
+/**
+ * Lock metadata
+ * "allocated"/"freed" on demand.
+ *
+ * Used to dynamically bind numa_queue to a lock,
+ * maintain FIFO-order for the NUMA-queues.
+ *
+ * seq_counter is needed to distinguish metadata usage
+ * by different locks, preventing local contenders
+ * from queueing in the wrong per-NUMA queue
+ *
+ * @see set_bucket
+ * @see numa_xchg_tail
+ */
+struct lock_metadata {
+	atomic_t seq_counter;
+	struct qspinlock *lock_ptr;
+
+	/* NUMA-queues of contenders ae kept in FIFO order */
+	union {
+		u32 nodes_tail;
+		struct {
+			u16 tail_node;
+			u16 head_node;
+		};
+	};
+};
+
+static inline int decode_lock_mode(u32 lock_val)
+{
+	return (lock_val & _Q_LOCK_MODE_MASK) >> _Q_LOCK_MODE_OFFSET;
+}
+
+static inline u32 encode_lock_mode(u16 lock_id)
+{
+	if (lock_id == LOCK_ID_NONE)
+		return LOCK_MODE_QSPINLOCK << _Q_LOCK_MODE_OFFSET;
+
+	return LOCK_MODE_HQLOCK << _Q_LOCK_MODE_OFFSET;
+}
+
+static inline u64 encode_tc(u32 tail, u32 counter)
+{
+	u64 __tail = (u64)tail;
+	u64 __counter = (u64)counter;
+
+	return __tail | (__counter << 32);
+}
+
+static inline u32 decode_tc_tail(u64 tail_counter)
+{
+	return (u32)(tail_counter & _QUEUE_TAIL_MASK);
+}
+
+static inline u32 decode_tc_counter(u64 tail_counter)
+{
+	return (u32)((tail_counter & _QUEUE_SEQ_COUNTER_MASK) >> 32);
+}
-- 
2.34.1



^ permalink raw reply related

* [RFC PATCH v3 0/7]
From: Fedorov Nikita @ 2026-04-15 16:44 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, hpa, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, bcm-kernel-feedback-list, Arnd Bergmann,
	Peter Zijlstra, Boqun Feng, Waiman Long, Darren Hart,
	Davidlohr Bueso, andrealmeid, Andrew Morton, David Hildenbrand,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, byungchul,
	Gregory Price, Ying Huang, Alistair Popple, Anatoly Stepanov
  Cc: Nikita Fedorov, linux-arm-kernel, linux-kernel, virtualization,
	linux-arch, linux-mm, guohanjun, wangkefeng.wang, weiyongjun1,
	yusongping, leijitang, artem.kuzin, kang.sun, chenjieping3

Changes since v2:
  - split the patch into a smaller patch series,
  though it is still hard to split the hqlock internal logic itself
  - remove some unused code
  - rebased onto Linux v7.0
  - allocate hq-spinlock metadata with kvmalloc instead of memblock
  - added contention detection and verified we have no performance degradation in low contention scenario

[Motivation]

In a high contention case, existing Linux kernel spinlock implementations can become
inefficient on modern NUMA systems due to frequent and expensive
cross-NUMA cacheline transfers. 

This might happen due to following reasons:
 - on "contender enqueue" each lock contender updates a shared lock structure
 - on "MCS handoff" cross-NUMA cache-line transfer occurs when
two contenders are from different NUMA nodes.

Previous work regarding NUMA-aware spinlock in Linux kernel is CNA-lock:
https://lore.kernel.org/lkml/20210514200743.3026725-1-alex.kogan@oracle.com/

It reduces cross-NUMA cacheline traffic during handoff, but does not reduce it during enqueuing.
CNA design also requires the first contender to do additional work during global spinning
and keeps threads from all nodes other than the first one in the single secondary queue. 
In our measurements, we only saw benefits from using it on Kunpeng;
on x86 platforms, CNA behaved the same as a regular qspinlock.
Thus, there is still quite a lot of potential for optimization.

HQ-lock has completely different design concept: kind of cohort-lock and
queued-spinlock hybrid.

If someone wants to try the HQ-lock in some subsystem, just
change lock initialization code from `spin_lock_init()` to `spin_lock_init_hq()`,
or change `DEFINE_SPINLOCK()` macro to `DEFINE_SPINLOCK_HQ()` if the lock is static.
The dedicated bit in the lock structure is used to distiguish between the two lock types.

[Performance measurements]

Performance measurements were done on x86 (AMD EPYC) and arm64 (Kunpeng 920)
platforms with the following scenarious:
- Locktorture benchmark
- Memcached + memtier benchmark
- Ngnix + Wrk benchmark

[Locktorture]

NPS stands for "Nodes per socket"
+------------------------------+-----------------------+-------+-------+--------+
| AMD EPYC 9654									|
+------------------------------+-----------------------+-------+-------+--------+
| 192 cores (x2 hyper-threads) |                       |       |       |        |
| 2 sockets                    |                       |       |       |        |
| Locktorture 60 sec.          | NUMA nodes per-socket |       |       |        |
| Average gain (single lock)   | 1 NPS                 | 2 NPS | 4 NPS | 12 NPS |
| Total contender threads      |                       |       |       |        |
| 8                            | 19%                   | 21%   | 12%   | 12%    |
| 16                           | 13%                   | 18%   | 34%   | 75%    |
| 32                           | 8%                    | 14%   | 25%   | 112%   |
| 64                           | 11%                   | 12%   | 30%   | 152%   |
| 128                          | 9%                    | 17%   | 37%   | 163%   |
| 256                          | 2%                    | 16%   | 40%   | 168%   |
| 384                          | -1%                   | 14%   | 44%   | 186%   |
+------------------------------+-----------------------+-------+-------+--------+

+-----------------+-------+-------+-------+--------+
| Fairness factor | 1 NPS | 2 NPS | 4 NPS | 12 NPS |
+-----------------+-------+-------+-------+--------+
| 8               | 0.54  | 0.57  | 0.57  | 0.55   |
| 16              | 0.52  | 0.53  | 0.60  | 0.58   |
| 32              | 0.53  | 0.53  | 0.53  | 0.61   |
| 64              | 0.52  | 0.56  | 0.54  | 0.56   |
| 128             | 0.51  | 0.54  | 0.54  | 0.53   |
| 256             | 0.52  | 0.52  | 0.52  | 0.52   |
| 384             | 0.51  | 0.51  | 0.51  | 0.51   |
+-----------------+-------+-------+-------+--------+

+-------------------------+--------------+
| Kunpeng 920 (arm64)     |              |
+-------------------------+--------------+
| 96 cores (no MT)        |              |
| 2 sockets, 4 NUMA nodes |              |
| Locktorture 60 sec.     |              |
|                         |              |
| Total contender threads | Average gain |
| 8                       | 93%          |
| 16                      | 142%         |
| 32                      | 129%         |
| 64                      | 152%         |
| 96                      | 158%         |
+-------------------------+--------------+

[Memcached]

+---------------------------------+-----------------+-------------------+
| AMD EPYC 9654                   |                 |                   |
+---------------------------------+-----------------+-------------------+
| 192 cores (x2 hyper-threads)    |                 |                   |
| 2 sockets, NPS=4                |                 |                   |
|                                 |                 |                   |
| Memtier+memcached 1:1 R/W ratio |                 |                   |
| Workers                         | Throughput gain | Latency reduction |
| 32                              | 1%              | -1%               |
| 64                              | 1%              | -1%               |
| 128                             | 3%              | -4%               |
| 256                             | 7%              | -6%               |
| 384                             | 10%             | -8%               |
+---------------------------------+-----------------+-------------------+

+---------------------------------+-----------------+-------------------+
| Kunpeng 920 (arm64)             |                 |                   |
+---------------------------------+-----------------+-------------------+
| 96 cores (no MT)                |                 |                   |
| 2 sockets, 4 NUMA nodes         |                 |                   |
|                                 |                 |                   |
| Memtier+memcached 1:1 R/W ratio |                 |                   |
| Workers                         | Throughput gain | Latency reduction |
| 32                              | 4%              | -3%               |
| 64                              | 6%              | -6%               |
| 80                              | 8%              | -7%               |
| 96                              | 8%              | -8%               |
+---------------------------------+-----------------+-------------------+

[Nginx]

+-----------------------------------------------------------------------+-----------------+
| Kunpeng 920 (arm64)                                                   |                 |
+-----------------------------------------------------------------------+-----------------+
| 96 cores (no MT)                                                      |                 |
| 2 sockets, 4 NUMA nodes                                               |                 |
|                                                                       |                 |
| Nginx + WRK benchmark, single file (lockref spinlock contention case) |                 |
| Workers                                                               | Throughput gain |
| 32                                                                    | 1%              |
| 64                                                                    | 68%             |
| 80                                                                    | 72%             |
| 96                                                                    | 78%             |
+-----------------------------------------------------------------------+-----------------+
Despite, the test is a single-file test, it can be related to real-life cases, when some
html-pages are accessed much more frequently than others (index.html, etc.)

[Low contention remarks]
After adding contention detection scheme, we do not see performance degradation in low contention scenario (< 8 threads),
throughput of HQspinlock is equal to qspinlock,
while still having practically the same improvement in high contention case as mentioned above.

Previous version:
https://lore.kernel.org/lkml/20251206062106.2109014-1-stepanov.anatoly@huawei.com/

Anatoly Stepanov (7):
  kernel: add hq-spinlock types
  hq-spinlock: implement inner logic
  hq-spinlock: add contention detection
  hq-spinlock: add hq-spinlock tunables and debug statistics
  kernel: introduce general hq-spinlock support
  lockref: use hq-spinlock
  futex: use hq-spinlock for hash buckets

 arch/arm64/include/asm/qspinlock.h       |  37 +
 arch/x86/include/asm/hq-spinlock.h       |  34 +
 arch/x86/include/asm/paravirt-spinlock.h |   3 +-
 arch/x86/include/asm/qspinlock.h         |   6 +-
 include/asm-generic/qspinlock.h          |  23 +-
 include/asm-generic/qspinlock_types.h    |  44 +-
 include/linux/lockref.h                  |   2 +-
 include/linux/spinlock.h                 |  26 +
 include/linux/spinlock_types.h           |  26 +
 include/linux/spinlock_types_raw.h       |  20 +
 kernel/Kconfig.locks                     |  29 +
 kernel/futex/core.c                      |   2 +-
 kernel/locking/hqlock_core.h             | 850 +++++++++++++++++++++++
 kernel/locking/hqlock_meta.h             | 487 +++++++++++++
 kernel/locking/hqlock_proc.h             | 164 +++++
 kernel/locking/hqlock_types.h            | 122 ++++
 kernel/locking/qspinlock.c               |  65 +-
 kernel/locking/qspinlock.h               |   4 +-
 kernel/locking/spinlock_debug.c          |  20 +
 mm/mempolicy.c                           |   4 +
 20 files changed, 1939 insertions(+), 29 deletions(-)
 create mode 100644 arch/arm64/include/asm/qspinlock.h
 create mode 100644 arch/x86/include/asm/hq-spinlock.h
 create mode 100644 kernel/locking/hqlock_core.h
 create mode 100644 kernel/locking/hqlock_meta.h
 create mode 100644 kernel/locking/hqlock_proc.h
 create mode 100644 kernel/locking/hqlock_types.h

-- 
2.34.1



^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox