[PATCH v2] iommu/arm-smmu-qcom: Only enable stall on smmu-v2

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2] iommu/arm-smmu-qcom: Only enable stall on smmu-v2
@ 2025-01-02 18:32 Rob Clark
  2025-01-02 19:30 ` Akhil P Oommen
  2025-01-07 12:57 ` Will Deacon
  0 siblings, 2 replies; 8+ messages in thread
From: Rob Clark @ 2025-01-02 18:32 UTC (permalink / raw)
  To: iommu
  Cc: linux-arm-msm, freedreno, Robin Murphy, Will Deacon, Rob Clark,
	Rob Clark, Joerg Roedel, moderated list:ARM SMMU DRIVERS,
	open list

From: Rob Clark <robdclark@chromium.org>

On mmu-500, stall-on-fault seems to stall all context banks, causing the
GMU to misbehave.  So limit this feature to smmu-v2 for now.

This fixes an issue with an older mesa bug taking outo the system
because of GMU going off into the weeds.

What we _think_ is happening is that, if the GPU generates 1000's of
faults at ~once (which is something that GPUs can be good at), it can
result in a sufficient number of stalled translations preventing other
transactions from entering the same TBU.

Signed-off-by: Rob Clark <robdclark@chromium.org>
---
v2: Adds a modparam to override the default behavior, for debugging
    GPU faults in cases which do not (or might not) cause lockup.
    Also, rebased to not depend on Bibek's PRR support.

 drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
index 6372f3e25c4b..3239bbf18514 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
@@ -16,6 +16,10 @@
 
 #define QCOM_DUMMY_VAL	-1
 
+static int enable_stall = -1;
+MODULE_PARM_DESC(enable_stall, "Enable stall on iova fault (1=on , 0=disable, -1=auto (default))");
+module_param(enable_stall, int, 0600);
+
 static struct qcom_smmu *to_qcom_smmu(struct arm_smmu_device *smmu)
 {
 	return container_of(smmu, struct qcom_smmu, smmu);
@@ -210,7 +214,9 @@ static bool qcom_adreno_can_do_ttbr1(struct arm_smmu_device *smmu)
 static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain,
 		struct io_pgtable_cfg *pgtbl_cfg, struct device *dev)
 {
+	const struct device_node *np = smmu_domain->smmu->dev->of_node;
 	struct adreno_smmu_priv *priv;
+	bool stall_enabled;
 
 	smmu_domain->cfg.flush_walk_prefer_tlbiasid = true;
 
@@ -237,8 +243,17 @@ static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain,
 	priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg;
 	priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg;
 	priv->get_fault_info = qcom_adreno_smmu_get_fault_info;
-	priv->set_stall = qcom_adreno_smmu_set_stall;
-	priv->resume_translation = qcom_adreno_smmu_resume_translation;
+
+	if (enable_stall < 0) {
+		stall_enabled = of_device_is_compatible(np, "qcom,smmu-v2");
+	} else {
+		stall_enabled = !!enable_stall;
+	}
+
+	if (stall_enabled) {
+		priv->set_stall = qcom_adreno_smmu_set_stall;
+		priv->resume_translation = qcom_adreno_smmu_resume_translation;
+	}
 
 	return 0;
 }
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] iommu/arm-smmu-qcom: Only enable stall on smmu-v2
  2025-01-02 18:32 [PATCH v2] iommu/arm-smmu-qcom: Only enable stall on smmu-v2 Rob Clark
@ 2025-01-02 19:30 ` Akhil P Oommen
  2025-01-06 20:10   ` Akhil P Oommen
  2025-01-07 12:57 ` Will Deacon
  1 sibling, 1 reply; 8+ messages in thread
From: Akhil P Oommen @ 2025-01-02 19:30 UTC (permalink / raw)
  To: Rob Clark, iommu
  Cc: linux-arm-msm, freedreno, Robin Murphy, Will Deacon, Rob Clark,
	Joerg Roedel, moderated list:ARM SMMU DRIVERS, open list

On 1/3/2025 12:02 AM, Rob Clark wrote:
> From: Rob Clark <robdclark@chromium.org>
> 
> On mmu-500, stall-on-fault seems to stall all context banks, causing the
> GMU to misbehave.  So limit this feature to smmu-v2 for now.
> 
> This fixes an issue with an older mesa bug taking outo the system
> because of GMU going off into the weeds.
> 
> What we _think_ is happening is that, if the GPU generates 1000's of
> faults at ~once (which is something that GPUs can be good at), it can
> result in a sufficient number of stalled translations preventing other
> transactions from entering the same TBU.
> 
> Signed-off-by: Rob Clark <robdclark@chromium.org>

Reviewed-by: Akhil P Oommen <quic_akhilpo@quicinc.com>

-Akhil

> ---
> v2: Adds a modparam to override the default behavior, for debugging
>     GPU faults in cases which do not (or might not) cause lockup.
>     Also, rebased to not depend on Bibek's PRR support.
> 
>  drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 19 +++++++++++++++++--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> index 6372f3e25c4b..3239bbf18514 100644
> --- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> @@ -16,6 +16,10 @@
>  
>  #define QCOM_DUMMY_VAL	-1
>  
> +static int enable_stall = -1;
> +MODULE_PARM_DESC(enable_stall, "Enable stall on iova fault (1=on , 0=disable, -1=auto (default))");
> +module_param(enable_stall, int, 0600);
> +
>  static struct qcom_smmu *to_qcom_smmu(struct arm_smmu_device *smmu)
>  {
>  	return container_of(smmu, struct qcom_smmu, smmu);
> @@ -210,7 +214,9 @@ static bool qcom_adreno_can_do_ttbr1(struct arm_smmu_device *smmu)
>  static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain,
>  		struct io_pgtable_cfg *pgtbl_cfg, struct device *dev)
>  {
> +	const struct device_node *np = smmu_domain->smmu->dev->of_node;
>  	struct adreno_smmu_priv *priv;
> +	bool stall_enabled;
>  
>  	smmu_domain->cfg.flush_walk_prefer_tlbiasid = true;
>  
> @@ -237,8 +243,17 @@ static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain,
>  	priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg;
>  	priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg;
>  	priv->get_fault_info = qcom_adreno_smmu_get_fault_info;
> -	priv->set_stall = qcom_adreno_smmu_set_stall;
> -	priv->resume_translation = qcom_adreno_smmu_resume_translation;
> +
> +	if (enable_stall < 0) {
> +		stall_enabled = of_device_is_compatible(np, "qcom,smmu-v2");
> +	} else {
> +		stall_enabled = !!enable_stall;
> +	}
> +
> +	if (stall_enabled) {
> +		priv->set_stall = qcom_adreno_smmu_set_stall;
> +		priv->resume_translation = qcom_adreno_smmu_resume_translation;
> +	}
>  
>  	return 0;
>  }


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] iommu/arm-smmu-qcom: Only enable stall on smmu-v2
  2025-01-02 19:30 ` Akhil P Oommen
@ 2025-01-06 20:10   ` Akhil P Oommen
  2025-01-06 20:59     ` Rob Clark
  2025-01-06 21:00     ` Rob Clark
  0 siblings, 2 replies; 8+ messages in thread
From: Akhil P Oommen @ 2025-01-06 20:10 UTC (permalink / raw)
  To: Rob Clark, iommu
  Cc: linux-arm-msm, freedreno, Robin Murphy, Will Deacon, Rob Clark,
	Joerg Roedel, moderated list:ARM SMMU DRIVERS, open list

On 1/3/2025 1:00 AM, Akhil P Oommen wrote:
> On 1/3/2025 12:02 AM, Rob Clark wrote:
>> From: Rob Clark <robdclark@chromium.org>
>>
>> On mmu-500, stall-on-fault seems to stall all context banks, causing the
>> GMU to misbehave.  So limit this feature to smmu-v2 for now.
>>
>> This fixes an issue with an older mesa bug taking outo the system
>> because of GMU going off into the weeds.
>>
>> What we _think_ is happening is that, if the GPU generates 1000's of
>> faults at ~once (which is something that GPUs can be good at), it can
>> result in a sufficient number of stalled translations preventing other
>> transactions from entering the same TBU.
>>
>> Signed-off-by: Rob Clark <robdclark@chromium.org>
> 
> Reviewed-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
> 

Btw, if stall is not enabled, I think there is no point in capturing
coredump from adreno pagefault handler. By the time we start coredump,
gpu might have switched context.

-Akhil.

> -Akhil
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] iommu/arm-smmu-qcom: Only enable stall on smmu-v2
  2025-01-06 20:10   ` Akhil P Oommen
@ 2025-01-06 20:59     ` Rob Clark
  2025-01-06 21:00     ` Rob Clark
  1 sibling, 0 replies; 8+ messages in thread
From: Rob Clark @ 2025-01-06 20:59 UTC (permalink / raw)
  To: Akhil P Oommen
  Cc: iommu, linux-arm-msm, freedreno, Robin Murphy, Will Deacon,
	Rob Clark, Joerg Roedel, moderated list:ARM SMMU DRIVERS,
	open list

On Mon, Jan 6, 2025 at 12:11 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>
> On 1/3/2025 1:00 AM, Akhil P Oommen wrote:
> > On 1/3/2025 12:02 AM, Rob Clark wrote:
> >> From: Rob Clark <robdclark@chromium.org>
> >>
> >> On mmu-500, stall-on-fault seems to stall all context banks, causing the
> >> GMU to misbehave.  So limit this feature to smmu-v2 for now.
> >>
> >> This fixes an issue with an older mesa bug taking outo the system
> >> because of GMU going off into the weeds.
> >>
> >> What we _think_ is happening is that, if the GPU generates 1000's of
> >> faults at ~once (which is something that GPUs can be good at), it can
> >> result in a sufficient number of stalled translations preventing other
> >> transactions from entering the same TBU.
> >>
> >> Signed-off-by: Rob Clark <robdclark@chromium.org>
> >
> > Reviewed-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
> >
>
> Btw, if stall is not enabled, I think there is no point in capturing
> coredump from adreno pagefault handler. By the time we start coredump,
> gpu might have switched context.
>
> -Akhil.
>
> > -Akhil
> >

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] iommu/arm-smmu-qcom: Only enable stall on smmu-v2
  2025-01-06 20:10   ` Akhil P Oommen
  2025-01-06 20:59     ` Rob Clark
@ 2025-01-06 21:00     ` Rob Clark
  1 sibling, 0 replies; 8+ messages in thread
From: Rob Clark @ 2025-01-06 21:00 UTC (permalink / raw)
  To: Akhil P Oommen
  Cc: iommu, linux-arm-msm, freedreno, Robin Murphy, Will Deacon,
	Rob Clark, Joerg Roedel, moderated list:ARM SMMU DRIVERS,
	open list

On Mon, Jan 6, 2025 at 12:11 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>
> On 1/3/2025 1:00 AM, Akhil P Oommen wrote:
> > On 1/3/2025 12:02 AM, Rob Clark wrote:
> >> From: Rob Clark <robdclark@chromium.org>
> >>
> >> On mmu-500, stall-on-fault seems to stall all context banks, causing the
> >> GMU to misbehave.  So limit this feature to smmu-v2 for now.
> >>
> >> This fixes an issue with an older mesa bug taking outo the system
> >> because of GMU going off into the weeds.
> >>
> >> What we _think_ is happening is that, if the GPU generates 1000's of
> >> faults at ~once (which is something that GPUs can be good at), it can
> >> result in a sufficient number of stalled translations preventing other
> >> transactions from entering the same TBU.
> >>
> >> Signed-off-by: Rob Clark <robdclark@chromium.org>
> >
> > Reviewed-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
> >
>
> Btw, if stall is not enabled, I think there is no point in capturing
> coredump from adreno pagefault handler. By the time we start coredump,
> gpu might have switched context.

Hmm, we do at least capture ttbr0 both in fault info and from the
current submit, so it would at least be possible to tell if you are
looking at the wrong context.

BR,
-R

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] iommu/arm-smmu-qcom: Only enable stall on smmu-v2
  2025-01-02 18:32 [PATCH v2] iommu/arm-smmu-qcom: Only enable stall on smmu-v2 Rob Clark
  2025-01-02 19:30 ` Akhil P Oommen
@ 2025-01-07 12:57 ` Will Deacon
  2025-01-07 15:26   ` Rob Clark
  1 sibling, 1 reply; 8+ messages in thread
From: Will Deacon @ 2025-01-07 12:57 UTC (permalink / raw)
  To: Rob Clark
  Cc: iommu, linux-arm-msm, freedreno, Robin Murphy, Rob Clark,
	Joerg Roedel, moderated list:ARM SMMU DRIVERS, open list

On Thu, Jan 02, 2025 at 10:32:31AM -0800, Rob Clark wrote:
> From: Rob Clark <robdclark@chromium.org>
> 
> On mmu-500, stall-on-fault seems to stall all context banks, causing the
> GMU to misbehave.  So limit this feature to smmu-v2 for now.
> 
> This fixes an issue with an older mesa bug taking outo the system
> because of GMU going off into the weeds.
> 
> What we _think_ is happening is that, if the GPU generates 1000's of
> faults at ~once (which is something that GPUs can be good at), it can
> result in a sufficient number of stalled translations preventing other
> transactions from entering the same TBU.

MMU-500 is an implementation of the SMMUv2 architecture, so this feels
upside-down to me. That is, it should always be valid to probe with
the less specific "SMMUv2" compatible string (modulo hardware errata)
and be limited to the architectural behaviour.

So what is about MMU-500 that means stalling doesn't work when compared
to any other SMMUv2 implementation?

Will

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] iommu/arm-smmu-qcom: Only enable stall on smmu-v2
  2025-01-07 12:57 ` Will Deacon
@ 2025-01-07 15:26   ` Rob Clark
  2025-01-07 22:43     ` Dmitry Baryshkov
  0 siblings, 1 reply; 8+ messages in thread
From: Rob Clark @ 2025-01-07 15:26 UTC (permalink / raw)
  To: Will Deacon
  Cc: iommu, linux-arm-msm, freedreno, Robin Murphy, Rob Clark,
	Joerg Roedel, moderated list:ARM SMMU DRIVERS, open list

On Tue, Jan 7, 2025 at 4:57 AM Will Deacon <will@kernel.org> wrote:
>
> On Thu, Jan 02, 2025 at 10:32:31AM -0800, Rob Clark wrote:
> > From: Rob Clark <robdclark@chromium.org>
> >
> > On mmu-500, stall-on-fault seems to stall all context banks, causing the
> > GMU to misbehave.  So limit this feature to smmu-v2 for now.
> >
> > This fixes an issue with an older mesa bug taking outo the system
> > because of GMU going off into the weeds.
> >
> > What we _think_ is happening is that, if the GPU generates 1000's of
> > faults at ~once (which is something that GPUs can be good at), it can
> > result in a sufficient number of stalled translations preventing other
> > transactions from entering the same TBU.
>
> MMU-500 is an implementation of the SMMUv2 architecture, so this feels
> upside-down to me. That is, it should always be valid to probe with
> the less specific "SMMUv2" compatible string (modulo hardware errata)
> and be limited to the architectural behaviour.

I should have been more specific and referred to qcom,smmu-v2

> So what is about MMU-500 that means stalling doesn't work when compared
> to any other SMMUv2 implementation?

Well, I have a limited # of data points, in the sense that there
aren't too many a6xx devices prior to the switch to qcom,smmu-500..
but I have access to crash metrics for a lot of sc7180 devices
(qcom,smmu-v2), and I've been unable to find any signs of this sort of
stall related issue.

So maybe I can't 100% say this is qcom,smmu-500 vs qcom,smmu-v2, vs
some other change in later gens that used qcom,smmu-500 or some other
factor, I'm not sure what other conclusion to draw.

BR,
-R

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] iommu/arm-smmu-qcom: Only enable stall on smmu-v2
  2025-01-07 15:26   ` Rob Clark
@ 2025-01-07 22:43     ` Dmitry Baryshkov
  0 siblings, 0 replies; 8+ messages in thread
From: Dmitry Baryshkov @ 2025-01-07 22:43 UTC (permalink / raw)
  To: Rob Clark
  Cc: Will Deacon, iommu, linux-arm-msm, freedreno, Robin Murphy,
	Rob Clark, Joerg Roedel, moderated list:ARM SMMU DRIVERS,
	open list

On Tue, Jan 07, 2025 at 07:26:44AM -0800, Rob Clark wrote:
> On Tue, Jan 7, 2025 at 4:57 AM Will Deacon <will@kernel.org> wrote:
> >
> > On Thu, Jan 02, 2025 at 10:32:31AM -0800, Rob Clark wrote:
> > > From: Rob Clark <robdclark@chromium.org>
> > >
> > > On mmu-500, stall-on-fault seems to stall all context banks, causing the
> > > GMU to misbehave.  So limit this feature to smmu-v2 for now.
> > >
> > > This fixes an issue with an older mesa bug taking outo the system
> > > because of GMU going off into the weeds.
> > >
> > > What we _think_ is happening is that, if the GPU generates 1000's of
> > > faults at ~once (which is something that GPUs can be good at), it can
> > > result in a sufficient number of stalled translations preventing other
> > > transactions from entering the same TBU.
> >
> > MMU-500 is an implementation of the SMMUv2 architecture, so this feels
> > upside-down to me. That is, it should always be valid to probe with
> > the less specific "SMMUv2" compatible string (modulo hardware errata)
> > and be limited to the architectural behaviour.
> 
> I should have been more specific and referred to qcom,smmu-v2
> 
> > So what is about MMU-500 that means stalling doesn't work when compared
> > to any other SMMUv2 implementation?
> 
> Well, I have a limited # of data points, in the sense that there
> aren't too many a6xx devices prior to the switch to qcom,smmu-500..
> but I have access to crash metrics for a lot of sc7180 devices
> (qcom,smmu-v2), and I've been unable to find any signs of this sort of
> stall related issue.
> 
> So maybe I can't 100% say this is qcom,smmu-500 vs qcom,smmu-v2, vs
> some other change in later gens that used qcom,smmu-500 or some other
> factor, I'm not sure what other conclusion to draw.

Might it be that v2 was an actual hw, but mmu-500 is somehow
virtualized? And as such by these stalls we might be observing some kind
of FW bug in hyp?

> 
> BR,
> -R

-- 
With best wishes
Dmitry

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-01-07 22:43 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-02 18:32 [PATCH v2] iommu/arm-smmu-qcom: Only enable stall on smmu-v2 Rob Clark
2025-01-02 19:30 ` Akhil P Oommen
2025-01-06 20:10   ` Akhil P Oommen
2025-01-06 20:59     ` Rob Clark
2025-01-06 21:00     ` Rob Clark
2025-01-07 12:57 ` Will Deacon
2025-01-07 15:26   ` Rob Clark
2025-01-07 22:43     ` Dmitry Baryshkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox