public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v10 0/2] fix abort defect
@ 2024-10-01  9:19 peter.wang
  2024-10-01  9:19 ` [PATCH v10 1/2] ufs: core: fix the issue of ICU failure peter.wang
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: peter.wang @ 2024-10-01  9:19 UTC (permalink / raw)
  To: linux-scsi, martin.petersen, avri.altman, alim.akhtar, jejb
  Cc: wsd_upstream, linux-mediatek, peter.wang, chun-hung.wu,
	alice.chao, cc.chou, chaotian.jing, jiajie.hao, powen.kao,
	qilin.tan, lin.gui, tun-yu.yu, eddie.huang, naomi.chu, ed.tsai,
	bvanassche, quic_nguyenb

From: Peter Wang <peter.wang@mediatek.com>

V10:
 - Requeue OCS: ABORTED request in MCQ mode.

V9:
 - Revise the OCS content printed.
 
V8:
 - Remove the abort variable to simplify the abort process.
 - Correct error handler successfully aborts release flow.
 - Ingore MCQ OCS: ABORTED.

V7:
 - Use a variable instead of a flag.
 - Add a check for MCQ mode when setting this variable to UFS_ERR_HANDLER.
 - Print OCS information for OCS_ABORTED and OCS_INVALID_COMMAND_STATUS.
 - Add a MediaTek quirk for handling OCS_ABORTED in SDB mode.
 - Skip notifying SCSI from ISR during SCSI abort (ufshcd_abort()).

V6:
 - Add err handler check before set flag true.

V5:
 - Change flag name.
 - Amend comment and patch description.

V4:
 - Remove nullify SQ entry abort requeue.
 - Add more comment for flag usage and set description.
 - Fix build warning.

V3:
 - Change comment and use variable(rtc) for error print
 - Change flag name and move flag set before ufshcd_clear_cmd
 - Add SDB mode clear UTRLCLR tag receive OCS_ABORTED requeue

V2:
 - Fix mcq_enabled build error.

Peter Wang (2):
  ufs: core: fix the issue of ICU failure
  ufs: core: requeue aborted request

 drivers/ufs/core/ufs-mcq.c | 15 ++++++++-------
 drivers/ufs/core/ufshcd.c  | 20 ++++----------------
 2 files changed, 12 insertions(+), 23 deletions(-)

-- 
2.45.2


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v10 1/2] ufs: core: fix the issue of ICU failure
  2024-10-01  9:19 [PATCH v10 0/2] fix abort defect peter.wang
@ 2024-10-01  9:19 ` peter.wang
  2024-10-01  9:19 ` [PATCH v10 2/2] ufs: core: requeue aborted request peter.wang
  2024-10-16  2:38 ` [PATCH v10 0/2] fix abort defect Martin K. Petersen
  2 siblings, 0 replies; 13+ messages in thread
From: peter.wang @ 2024-10-01  9:19 UTC (permalink / raw)
  To: linux-scsi, martin.petersen, avri.altman, alim.akhtar, jejb
  Cc: wsd_upstream, linux-mediatek, peter.wang, chun-hung.wu,
	alice.chao, cc.chou, chaotian.jing, jiajie.hao, powen.kao,
	qilin.tan, lin.gui, tun-yu.yu, eddie.huang, naomi.chu, ed.tsai,
	bvanassche, quic_nguyenb, stable

From: Peter Wang <peter.wang@mediatek.com>

When setting the ICU bit without using read-modify-write,
SQRTCy will restart SQ again and receive an RTC return
error code 2 (Failure - SQ not stopped).

Additionally, the error log has been modified so that
this type of error can be observed.

Fixes: ab248643d3d6 ("scsi: ufs: core: Add error handling for MCQ mode")
Cc: stable@vger.kernel.org
Signed-off-by: Peter Wang <peter.wang@mediatek.com>
Reviewed-by: Bao D. Nguyen <quic_nguyenb@quicinc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/ufs/core/ufs-mcq.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/ufs/core/ufs-mcq.c b/drivers/ufs/core/ufs-mcq.c
index 5891cdacd0b3..3903947dbed1 100644
--- a/drivers/ufs/core/ufs-mcq.c
+++ b/drivers/ufs/core/ufs-mcq.c
@@ -539,7 +539,7 @@ int ufshcd_mcq_sq_cleanup(struct ufs_hba *hba, int task_tag)
 	struct scsi_cmnd *cmd = lrbp->cmd;
 	struct ufs_hw_queue *hwq;
 	void __iomem *reg, *opr_sqd_base;
-	u32 nexus, id, val;
+	u32 nexus, id, val, rtc;
 	int err;
 
 	if (hba->quirks & UFSHCD_QUIRK_MCQ_BROKEN_RTC)
@@ -569,17 +569,18 @@ int ufshcd_mcq_sq_cleanup(struct ufs_hba *hba, int task_tag)
 	opr_sqd_base = mcq_opr_base(hba, OPR_SQD, id);
 	writel(nexus, opr_sqd_base + REG_SQCTI);
 
-	/* SQRTCy.ICU = 1 */
-	writel(SQ_ICU, opr_sqd_base + REG_SQRTC);
+	/* Initiate Cleanup */
+	writel(readl(opr_sqd_base + REG_SQRTC) | SQ_ICU,
+		opr_sqd_base + REG_SQRTC);
 
 	/* Poll SQRTSy.CUS = 1. Return result from SQRTSy.RTC */
 	reg = opr_sqd_base + REG_SQRTS;
 	err = read_poll_timeout(readl, val, val & SQ_CUS, 20,
 				MCQ_POLL_US, false, reg);
-	if (err)
-		dev_err(hba->dev, "%s: failed. hwq=%d, tag=%d err=%ld\n",
-			__func__, id, task_tag,
-			FIELD_GET(SQ_ICU_ERR_CODE_MASK, readl(reg)));
+	rtc = FIELD_GET(SQ_ICU_ERR_CODE_MASK, readl(reg));
+	if (err || rtc)
+		dev_err(hba->dev, "%s: failed. hwq=%d, tag=%d err=%d RTC=%d\n",
+			__func__, id, task_tag, err, rtc);
 
 	if (ufshcd_mcq_sq_start(hba, hwq))
 		err = -ETIMEDOUT;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v10 2/2] ufs: core: requeue aborted request
  2024-10-01  9:19 [PATCH v10 0/2] fix abort defect peter.wang
  2024-10-01  9:19 ` [PATCH v10 1/2] ufs: core: fix the issue of ICU failure peter.wang
@ 2024-10-01  9:19 ` peter.wang
  2024-10-01 17:13   ` Bart Van Assche
  2024-10-09 17:59   ` Bart Van Assche
  2024-10-16  2:38 ` [PATCH v10 0/2] fix abort defect Martin K. Petersen
  2 siblings, 2 replies; 13+ messages in thread
From: peter.wang @ 2024-10-01  9:19 UTC (permalink / raw)
  To: linux-scsi, martin.petersen, avri.altman, alim.akhtar, jejb
  Cc: wsd_upstream, linux-mediatek, peter.wang, chun-hung.wu,
	alice.chao, cc.chou, chaotian.jing, jiajie.hao, powen.kao,
	qilin.tan, lin.gui, tun-yu.yu, eddie.huang, naomi.chu, ed.tsai,
	bvanassche, quic_nguyenb, stable

From: Peter Wang <peter.wang@mediatek.com>

After the SQ cleanup fix, the CQ will receive a response with
the corresponding tag marked as OCS: ABORTED. To align with
the behavior of Legacy SDB mode, the handling of OCS: ABORTED
has been changed to match that of OCS_INVALID_COMMAND_STATUS
(SDB), with both returning a SCSI result of DID_REQUEUE.

Furthermore, the workaround implemented before the SQ cleanup
fix can be removed.

Fixes: ab248643d3d6 ("scsi: ufs: core: Add error handling for MCQ mode")
Cc: stable@vger.kernel.org
Signed-off-by: Peter Wang <peter.wang@mediatek.com>
---
 drivers/ufs/core/ufshcd.c | 20 ++++----------------
 1 file changed, 4 insertions(+), 16 deletions(-)

diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
index 24a32e2fd75e..8e2a7889a565 100644
--- a/drivers/ufs/core/ufshcd.c
+++ b/drivers/ufs/core/ufshcd.c
@@ -5417,10 +5417,12 @@ ufshcd_transfer_rsp_status(struct ufs_hba *hba, struct ufshcd_lrb *lrbp,
 		}
 		break;
 	case OCS_ABORTED:
-		result |= DID_ABORT << 16;
-		break;
 	case OCS_INVALID_COMMAND_STATUS:
 		result |= DID_REQUEUE << 16;
+		dev_warn(hba->dev,
+				"OCS %s from controller for tag %d\n",
+				(ocs == OCS_ABORTED? "aborted" : "invalid"),
+				lrbp->task_tag);
 		break;
 	case OCS_INVALID_CMD_TABLE_ATTR:
 	case OCS_INVALID_PRDT_ATTR:
@@ -6466,26 +6468,12 @@ static bool ufshcd_abort_one(struct request *rq, void *priv)
 	struct scsi_device *sdev = cmd->device;
 	struct Scsi_Host *shost = sdev->host;
 	struct ufs_hba *hba = shost_priv(shost);
-	struct ufshcd_lrb *lrbp = &hba->lrb[tag];
-	struct ufs_hw_queue *hwq;
-	unsigned long flags;
 
 	*ret = ufshcd_try_to_abort_task(hba, tag);
 	dev_err(hba->dev, "Aborting tag %d / CDB %#02x %s\n", tag,
 		hba->lrb[tag].cmd ? hba->lrb[tag].cmd->cmnd[0] : -1,
 		*ret ? "failed" : "succeeded");
 
-	/* Release cmd in MCQ mode if abort succeeds */
-	if (hba->mcq_enabled && (*ret == 0)) {
-		hwq = ufshcd_mcq_req_to_hwq(hba, scsi_cmd_to_rq(lrbp->cmd));
-		if (!hwq)
-			return 0;
-		spin_lock_irqsave(&hwq->cq_lock, flags);
-		if (ufshcd_cmd_inflight(lrbp->cmd))
-			ufshcd_release_scsi_cmd(hba, lrbp);
-		spin_unlock_irqrestore(&hwq->cq_lock, flags);
-	}
-
 	return *ret == 0;
 }
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v10 2/2] ufs: core: requeue aborted request
  2024-10-01  9:19 ` [PATCH v10 2/2] ufs: core: requeue aborted request peter.wang
@ 2024-10-01 17:13   ` Bart Van Assche
  2024-10-02 12:42     ` Peter Wang (王信友)
  2024-10-09 17:59   ` Bart Van Assche
  1 sibling, 1 reply; 13+ messages in thread
From: Bart Van Assche @ 2024-10-01 17:13 UTC (permalink / raw)
  To: peter.wang, linux-scsi, martin.petersen; +Cc: wsd_upstream, linux-mediatek

On 10/1/24 2:19 AM, peter.wang@mediatek.com wrote:
> From: Peter Wang <peter.wang@mediatek.com>
> 
> After the SQ cleanup fix, the CQ will receive a response with
> the corresponding tag marked as OCS: ABORTED. To align with
> the behavior of Legacy SDB mode, the handling of OCS: ABORTED
> has been changed to match that of OCS_INVALID_COMMAND_STATUS
> (SDB), with both returning a SCSI result of DID_REQUEUE.
> 
> Furthermore, the workaround implemented before the SQ cleanup
> fix can be removed.
> 
> Fixes: ab248643d3d6 ("scsi: ufs: core: Add error handling for MCQ mode")
> Cc: stable@vger.kernel.org
> Signed-off-by: Peter Wang <peter.wang@mediatek.com>
> ---
>   drivers/ufs/core/ufshcd.c | 20 ++++----------------
>   1 file changed, 4 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
> index 24a32e2fd75e..8e2a7889a565 100644
> --- a/drivers/ufs/core/ufshcd.c
> +++ b/drivers/ufs/core/ufshcd.c
> @@ -5417,10 +5417,12 @@ ufshcd_transfer_rsp_status(struct ufs_hba *hba, struct ufshcd_lrb *lrbp,
>   		}
>   		break;
>   	case OCS_ABORTED:
> -		result |= DID_ABORT << 16;
> -		break;
>   	case OCS_INVALID_COMMAND_STATUS:
>   		result |= DID_REQUEUE << 16;
> +		dev_warn(hba->dev,
> +				"OCS %s from controller for tag %d\n",
> +				(ocs == OCS_ABORTED? "aborted" : "invalid"),
> +				lrbp->task_tag);
>   		break;
>   	case OCS_INVALID_CMD_TABLE_ATTR:
>   	case OCS_INVALID_PRDT_ATTR:
> @@ -6466,26 +6468,12 @@ static bool ufshcd_abort_one(struct request *rq, void *priv)
>   	struct scsi_device *sdev = cmd->device;
>   	struct Scsi_Host *shost = sdev->host;
>   	struct ufs_hba *hba = shost_priv(shost);
> -	struct ufshcd_lrb *lrbp = &hba->lrb[tag];
> -	struct ufs_hw_queue *hwq;
> -	unsigned long flags;
>   
>   	*ret = ufshcd_try_to_abort_task(hba, tag);
>   	dev_err(hba->dev, "Aborting tag %d / CDB %#02x %s\n", tag,
>   		hba->lrb[tag].cmd ? hba->lrb[tag].cmd->cmnd[0] : -1,
>   		*ret ? "failed" : "succeeded");
>   
> -	/* Release cmd in MCQ mode if abort succeeds */
> -	if (hba->mcq_enabled && (*ret == 0)) {
> -		hwq = ufshcd_mcq_req_to_hwq(hba, scsi_cmd_to_rq(lrbp->cmd));
> -		if (!hwq)
> -			return 0;
> -		spin_lock_irqsave(&hwq->cq_lock, flags);
> -		if (ufshcd_cmd_inflight(lrbp->cmd))
> -			ufshcd_release_scsi_cmd(hba, lrbp);
> -		spin_unlock_irqrestore(&hwq->cq_lock, flags);
> -	}
> -
>   	return *ret == 0;
>   }

As mentioned before, ufshcd_try_to_abort_task() cannot handle concurrent
scsi_done() calls. ufshcd_abort_one() calls ufshcd_try_to_abort_task()
without even trying to prevent that scsi_done() is called concurrently. 
Since this could result in a kernel crash, I think that it is important 
that this gets fixed, even if it requires modifying the SCSI core.

Bart.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v10 2/2] ufs: core: requeue aborted request
  2024-10-01 17:13   ` Bart Van Assche
@ 2024-10-02 12:42     ` Peter Wang (王信友)
  2024-10-03 20:02       ` Bart Van Assche
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Wang (王信友) @ 2024-10-02 12:42 UTC (permalink / raw)
  To: linux-scsi@vger.kernel.org, bvanassche@acm.org,
	martin.petersen@oracle.com
  Cc: wsd_upstream, linux-mediatek@lists.infradead.org

On Tue, 2024-10-01 at 10:13 -0700, Bart Van Assche wrote:
>  	 
> External email : Please do not click links or open attachments until
> you have verified the sender or the content.
>  On 10/1/24 2:19 AM, peter.wang@mediatek.com wrote:
> > From: Peter Wang <peter.wang@mediatek.com>
> > 
> > After the SQ cleanup fix, the CQ will receive a response with
> > the corresponding tag marked as OCS: ABORTED. To align with
> > the behavior of Legacy SDB mode, the handling of OCS: ABORTED
> > has been changed to match that of OCS_INVALID_COMMAND_STATUS
> > (SDB), with both returning a SCSI result of DID_REQUEUE.
> > 
> > Furthermore, the workaround implemented before the SQ cleanup
> > fix can be removed.
> > 
> > Fixes: ab248643d3d6 ("scsi: ufs: core: Add error handling for MCQ
> mode")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Peter Wang <peter.wang@mediatek.com>
> > ---
> >   drivers/ufs/core/ufshcd.c | 20 ++++----------------
> >   1 file changed, 4 insertions(+), 16 deletions(-)
> > 
> > diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
> > index 24a32e2fd75e..8e2a7889a565 100644
> > --- a/drivers/ufs/core/ufshcd.c
> > +++ b/drivers/ufs/core/ufshcd.c
> > @@ -5417,10 +5417,12 @@ ufshcd_transfer_rsp_status(struct ufs_hba
> *hba, struct ufshcd_lrb *lrbp,
> >   }
> >   break;
> >   case OCS_ABORTED:
> > -result |= DID_ABORT << 16;
> > -break;
> >   case OCS_INVALID_COMMAND_STATUS:
> >   result |= DID_REQUEUE << 16;
> > +dev_warn(hba->dev,
> > +"OCS %s from controller for tag %d\n",
> > +(ocs == OCS_ABORTED? "aborted" : "invalid"),
> > +lrbp->task_tag);
> >   break;
> >   case OCS_INVALID_CMD_TABLE_ATTR:
> >   case OCS_INVALID_PRDT_ATTR:
> > @@ -6466,26 +6468,12 @@ static bool ufshcd_abort_one(struct request
> *rq, void *priv)
> >   struct scsi_device *sdev = cmd->device;
> >   struct Scsi_Host *shost = sdev->host;
> >   struct ufs_hba *hba = shost_priv(shost);
> > -struct ufshcd_lrb *lrbp = &hba->lrb[tag];
> > -struct ufs_hw_queue *hwq;
> > -unsigned long flags;
> >   
> >   *ret = ufshcd_try_to_abort_task(hba, tag);
> >   dev_err(hba->dev, "Aborting tag %d / CDB %#02x %s\n", tag,
> >   hba->lrb[tag].cmd ? hba->lrb[tag].cmd->cmnd[0] : -1,
> >   *ret ? "failed" : "succeeded");
> >   
> > -/* Release cmd in MCQ mode if abort succeeds */
> > -if (hba->mcq_enabled && (*ret == 0)) {
> > -hwq = ufshcd_mcq_req_to_hwq(hba, scsi_cmd_to_rq(lrbp->cmd));
> > -if (!hwq)
> > -return 0;
> > -spin_lock_irqsave(&hwq->cq_lock, flags);
> > -if (ufshcd_cmd_inflight(lrbp->cmd))
> > -ufshcd_release_scsi_cmd(hba, lrbp);
> > -spin_unlock_irqrestore(&hwq->cq_lock, flags);
> > -}
> > -
> >   return *ret == 0;
> >   }
> 
> As mentioned before, ufshcd_try_to_abort_task() cannot handle
> concurrent
> scsi_done() calls. ufshcd_abort_one() calls
> ufshcd_try_to_abort_task()
> without even trying to prevent that scsi_done() is called
> concurrently. 
> Since this could result in a kernel crash, I think that it is
> important 
> that this gets fixed, even if it requires modifying the SCSI core.
> 
> Bart.
> 
> 

Hi Bart,

This patch merely aligns with the approach of SDB mode 
and does not involve the flow of scsi_done. Besides, 
I don't see any issue with concurrency between 
ufshcd_abort_one() calling ufshcd_try_to_abort_task() 
and scsi_done(). Can you point out the specific flow where 
the problem occurs? If there is one, shouldn't SDB mode
have the same issue?

Thanks
Peter





^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v10 2/2] ufs: core: requeue aborted request
  2024-10-02 12:42     ` Peter Wang (王信友)
@ 2024-10-03 20:02       ` Bart Van Assche
  2024-10-07  7:20         ` Peter Wang (王信友)
  0 siblings, 1 reply; 13+ messages in thread
From: Bart Van Assche @ 2024-10-03 20:02 UTC (permalink / raw)
  To: Peter Wang (王信友), linux-scsi@vger.kernel.org,
	martin.petersen@oracle.com
  Cc: wsd_upstream, linux-mediatek@lists.infradead.org

On 10/2/24 5:42 AM, Peter Wang (王信友) wrote:
> This patch merely aligns with the approach of SDB mode
> and does not involve the flow of scsi_done. Besides,
> I don't see any issue with concurrency between
> ufshcd_abort_one() calling ufshcd_try_to_abort_task()
> and scsi_done(). Can you point out the specific flow where
> the problem occurs? If there is one, shouldn't SDB mode
> have the same issue?

Hi Peter,

Correct, my comment applies to both legacy mode and MCQ mode. From the 
section in the UFS standard about ABORT TASK: "A response of FUNCTION
COMPLETE shall indicate that the command was aborted or was not in the 
task set." In other words, if a command completes just before 
ufshcd_try_to_abort_task() calls ufshcd_issue_tm_cmd(), then
ufshcd_try_to_abort_task() will call ufshcd_clear_cmd() for a command
that has already completed. In legacy mode, this call will succeed.
Hence, both ufshcd_compl_one_cqe() and ufshcd_abort_all() will call
ufshcd_release(hba). This will cause hba->clk_gating.active_reqs to be
decremented twice instead of only once. Do you agree that this can
happen and also that it should be prevented that this happens?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v10 2/2] ufs: core: requeue aborted request
  2024-10-03 20:02       ` Bart Van Assche
@ 2024-10-07  7:20         ` Peter Wang (王信友)
  2024-10-08 18:29           ` Bart Van Assche
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Wang (王信友) @ 2024-10-07  7:20 UTC (permalink / raw)
  To: linux-scsi@vger.kernel.org, bvanassche@acm.org,
	martin.petersen@oracle.com
  Cc: wsd_upstream, linux-mediatek@lists.infradead.org

On Thu, 2024-10-03 at 13:02 -0700, Bart Van Assche wrote:
>  	 
> External email : Please do not click links or open attachments until
> you have verified the sender or the content.
>  On 10/2/24 5:42 AM, Peter Wang (王信友) wrote:
> > This patch merely aligns with the approach of SDB mode
> > and does not involve the flow of scsi_done. Besides,
> > I don't see any issue with concurrency between
> > ufshcd_abort_one() calling ufshcd_try_to_abort_task()
> > and scsi_done(). Can you point out the specific flow where
> > the problem occurs? If there is one, shouldn't SDB mode
> > have the same issue?
> 
> Hi Peter,
> 
> Correct, my comment applies to both legacy mode and MCQ mode. From
> the 
> section in the UFS standard about ABORT TASK: "A response of FUNCTION
> COMPLETE shall indicate that the command was aborted or was not in
> the 
> task set." In other words, if a command completes just before 
> ufshcd_try_to_abort_task() calls ufshcd_issue_tm_cmd(), then
> ufshcd_try_to_abort_task() will call ufshcd_clear_cmd() for a command
> that has already completed. In legacy mode, this call will succeed.
> 

Hi Bart,

Yes, the legacy SDB mode is protected by the outstanding_lock.


> Hence, both ufshcd_compl_one_cqe() and ufshcd_abort_all() will call
> ufshcd_release(hba). This will cause hba->clk_gating.active_reqs to
> be
> decremented twice instead of only once. Do you agree that this can
> happen and also that it should be prevented that this happens?
> 
> Thanks,
> 
> Bart.

Sorry, I still don't understand why both ufshcd_compl_one_cqe() 
and ufshcd_abort_all() will call ufshcd_release(hba)? 
Because I have already removed the ufshcd_release_scsi_cmd from 
ufshcd_abort_one, so the command won't be released immediately 
after ufshcd_try_to_abort_task succeeds. Instead, it will wait 
until the CQ Entry comes in before releasing. And since it is 
protected by the cq_lock, it should only release once, right?

Thanks.
Peter


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v10 2/2] ufs: core: requeue aborted request
  2024-10-07  7:20         ` Peter Wang (王信友)
@ 2024-10-08 18:29           ` Bart Van Assche
  2024-10-09  2:17             ` Peter Wang (王信友)
  0 siblings, 1 reply; 13+ messages in thread
From: Bart Van Assche @ 2024-10-08 18:29 UTC (permalink / raw)
  To: Peter Wang (王信友), linux-scsi@vger.kernel.org,
	martin.petersen@oracle.com
  Cc: wsd_upstream, linux-mediatek@lists.infradead.org

On 10/7/24 12:20 AM, Peter Wang (王信友) wrote:
> On Thu, 2024-10-03 at 13:02 -0700, Bart Van Assche wrote:
>>   	
>> External email : Please do not click links or open attachments until
>> you have verified the sender or the content.
>>   On 10/2/24 5:42 AM, Peter Wang (王信友) wrote:
>>> This patch merely aligns with the approach of SDB mode
>>> and does not involve the flow of scsi_done. Besides,
>>> I don't see any issue with concurrency between
>>> ufshcd_abort_one() calling ufshcd_try_to_abort_task()
>>> and scsi_done(). Can you point out the specific flow where
>>> the problem occurs? If there is one, shouldn't SDB mode
>>> have the same issue?
>>
>> Hi Peter,
>>
>> Correct, my comment applies to both legacy mode and MCQ mode. From
>> the
>> section in the UFS standard about ABORT TASK: "A response of FUNCTION
>> COMPLETE shall indicate that the command was aborted or was not in
>> the
>> task set." In other words, if a command completes just before
>> ufshcd_try_to_abort_task() calls ufshcd_issue_tm_cmd(), then
>> ufshcd_try_to_abort_task() will call ufshcd_clear_cmd() for a command
>> that has already completed. In legacy mode, this call will succeed.
>>
> 
> Hi Bart,
> 
> Yes, the legacy SDB mode is protected by the outstanding_lock.
> 
> 
>> Hence, both ufshcd_compl_one_cqe() and ufshcd_abort_all() will call
>> ufshcd_release(hba). This will cause hba->clk_gating.active_reqs to
>> be
>> decremented twice instead of only once. Do you agree that this can
>> happen and also that it should be prevented that this happens?
>>
>> Thanks,
>>
>> Bart.
> 
> Sorry, I still don't understand why both ufshcd_compl_one_cqe()
> and ufshcd_abort_all() will call ufshcd_release(hba)?
> Because I have already removed the ufshcd_release_scsi_cmd from
> ufshcd_abort_one, so the command won't be released immediately
> after ufshcd_try_to_abort_task succeeds. Instead, it will wait
> until the CQ Entry comes in before releasing. And since it is
> protected by the cq_lock, it should only release once, right?

Hi Peter,

I think what you wrote applies to MCQ mode only. In my previous email
I clearly referred to "legacy mode" (SDB mode). Summarizing my previous
email, I think that in legacy mode it is possible that ufshcd_release()
is called twice while it only should be called once. Here are the
possible solutions I see:
* Add a function to the SCSI core for setting SCMD_STATE_COMPLETE. This
   may be controversial since no other SCSI LLD needs this functionality.
* Changing the error handling approach in the UFS driver to the same
   approach other SCSI LLDs use: instead of using queue_work() to
   activate the error handler, call scsi_schedule_eh(). This will cause
   the error handler to be activated later, namely after all pending
   commands have timed out instead of aborting any pending commands
   first.
* Add a variant of scsi_schedule_eh() to the SCSI core that accelerates
   error handling by calling scsi_timeout() on all pending commands.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v10 2/2] ufs: core: requeue aborted request
  2024-10-08 18:29           ` Bart Van Assche
@ 2024-10-09  2:17             ` Peter Wang (王信友)
  2024-10-09 18:06               ` Bart Van Assche
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Wang (王信友) @ 2024-10-09  2:17 UTC (permalink / raw)
  To: linux-scsi@vger.kernel.org, bvanassche@acm.org,
	martin.petersen@oracle.com
  Cc: wsd_upstream, linux-mediatek@lists.infradead.org

On Tue, 2024-10-08 at 11:29 -0700, Bart Van Assche wrote:
> Hi Peter,
> 
> I think what you wrote applies to MCQ mode only. In my previous email
> I clearly referred to "legacy mode" (SDB mode). Summarizing my
> previous
> email, I think that in legacy mode it is possible that
> ufshcd_release()
> is called twice while it only should be called once. Here are the
> possible solutions I see:
> * Add a function to the SCSI core for setting SCMD_STATE_COMPLETE.
> This
>    may be controversial since no other SCSI LLD needs this
> functionality.
> * Changing the error handling approach in the UFS driver to the same
>    approach other SCSI LLDs use: instead of using queue_work() to
>    activate the error handler, call scsi_schedule_eh(). This will
> cause
>    the error handler to be activated later, namely after all pending
>    commands have timed out instead of aborting any pending commands
>    first.
> * Add a variant of scsi_schedule_eh() to the SCSI core that
> accelerates
>    error handling by calling scsi_timeout() on all pending commands.
> 
> Thanks,
> 
> Bart.
> 

Hi Bart,

Yes, this patch is only for MCQ mode, because only MCQ mode 
receives OCS: ABORTED, right? This patch doesn't modify 
any of the Legacy mode flows, does it?

Additionally, I still don't understand why you say there would 
be an issue with legacy mode having duplicate ufshcd_release(hba) 
calls. As I mentioned before, it is protected by the 
outstanding_lock. Could you please clarify the detailed 
error flow?

Furthermore, even if there is an issue with Legacy mode, it 
should be addressed by a separate patch, not by this one, which is 
intended to resolve the MCQ mode issue. We shouldn't mix two 
different issues together, don't you agree?

Thanks
Peter



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v10 2/2] ufs: core: requeue aborted request
  2024-10-01  9:19 ` [PATCH v10 2/2] ufs: core: requeue aborted request peter.wang
  2024-10-01 17:13   ` Bart Van Assche
@ 2024-10-09 17:59   ` Bart Van Assche
  1 sibling, 0 replies; 13+ messages in thread
From: Bart Van Assche @ 2024-10-09 17:59 UTC (permalink / raw)
  To: peter.wang, linux-scsi, martin.petersen, avri.altman, alim.akhtar,
	jejb
  Cc: wsd_upstream, linux-mediatek, chun-hung.wu, alice.chao, cc.chou,
	chaotian.jing, jiajie.hao, powen.kao, qilin.tan, lin.gui,
	tun-yu.yu, eddie.huang, naomi.chu, ed.tsai, quic_nguyenb, stable

On 10/1/24 2:19 AM, peter.wang@mediatek.com wrote:
> After the SQ cleanup fix, the CQ will receive a response with
> the corresponding tag marked as OCS: ABORTED. To align with
> the behavior of Legacy SDB mode, the handling of OCS: ABORTED
> has been changed to match that of OCS_INVALID_COMMAND_STATUS
> (SDB), with both returning a SCSI result of DID_REQUEUE.
> 
> Furthermore, the workaround implemented before the SQ cleanup
> fix can be removed.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v10 2/2] ufs: core: requeue aborted request
  2024-10-09  2:17             ` Peter Wang (王信友)
@ 2024-10-09 18:06               ` Bart Van Assche
  2024-10-11  5:44                 ` Peter Wang (王信友)
  0 siblings, 1 reply; 13+ messages in thread
From: Bart Van Assche @ 2024-10-09 18:06 UTC (permalink / raw)
  To: Peter Wang (王信友), linux-scsi@vger.kernel.org,
	martin.petersen@oracle.com
  Cc: wsd_upstream, linux-mediatek@lists.infradead.org


On 10/8/24 7:17 PM, Peter Wang (王信友) wrote:
> Yes, this patch is only for MCQ mode, because only MCQ mode
> receives OCS: ABORTED, right? This patch doesn't modify
> any of the Legacy mode flows, does it?

Agreed. What I mentioned in my email is an existing bug in the legacy
flow for ufshcd_abort_all().

> Furthermore, even if there is an issue with Legacy mode, it
> should be addressed by a separate patch, not by this one, which is
> intended to resolve the MCQ mode issue. We shouldn't mix two
> different issues together, don't you agree?

Let's proceed with this patch series and let's address what I brought
up in my email separately.

With the current approach for error handling in the UFS driver, anyone
who wants to verify or modify ufshcd_try_to_abort_task() has to consider
all possible interleavings of ufshcd_try_to_abort_task() and the
completion path (ufshcd_compl_one_cqe()). That's an unnecessary burden
on UFS driver contributors. Additionally, this is error-prone. This
applies to both modes (legacy and MCQ). I know of reports of sporadic
crashes in legacy mode related to UFS error handling. I'm wondering
whether these are perhaps the result of the issue I mentioned in a
previous email. Anyway, I will look further into this myself as soon as
I have the time.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v10 2/2] ufs: core: requeue aborted request
  2024-10-09 18:06               ` Bart Van Assche
@ 2024-10-11  5:44                 ` Peter Wang (王信友)
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Wang (王信友) @ 2024-10-11  5:44 UTC (permalink / raw)
  To: linux-scsi@vger.kernel.org, bvanassche@acm.org,
	martin.petersen@oracle.com
  Cc: wsd_upstream, linux-mediatek@lists.infradead.org

On Wed, 2024-10-09 at 11:06 -0700, Bart Van Assche wrote:
>  	 
> External email : Please do not click links or open attachments until
> you have verified the sender or the content.
>  
> On 10/8/24 7:17 PM, Peter Wang (王信友) wrote:
> > Yes, this patch is only for MCQ mode, because only MCQ mode
> > receives OCS: ABORTED, right? This patch doesn't modify
> > any of the Legacy mode flows, does it?
> 
> Agreed. What I mentioned in my email is an existing bug in the legacy
> flow for ufshcd_abort_all().
> 
> > Furthermore, even if there is an issue with Legacy mode, it
> > should be addressed by a separate patch, not by this one, which is
> > intended to resolve the MCQ mode issue. We shouldn't mix two
> > different issues together, don't you agree?
> 
> Let's proceed with this patch series and let's address what I brought
> up in my email separately.
> 
> With the current approach for error handling in the UFS driver,
> anyone
> who wants to verify or modify ufshcd_try_to_abort_task() has to
> consider
> all possible interleavings of ufshcd_try_to_abort_task() and the
> completion path (ufshcd_compl_one_cqe()). That's an unnecessary
> burden
> on UFS driver contributors. Additionally, this is error-prone. This
> applies to both modes (legacy and MCQ). I know of reports of sporadic
> crashes in legacy mode related to UFS error handling. I'm wondering
> whether these are perhaps the result of the issue I mentioned in a
> previous email. Anyway, I will look further into this myself as soon
> as
> I have the time.
> 
> Thanks,
> 
> Bart.

Hi Bart,

Thank you for your review.
 
I currently cannot see the issue of duplicate releases in 
legacy SDB mode. ufshcd_try_to_abort_task() will directly 
reset if it fails. It is only in the case of success that 
we need to consider the possibility of ufshcd_compl_one_cqe. 
I believe the original design flow has already taken this 
into account, which is why there is protection with 
outstanding_lock/cq_lock. Perhaps we can wait for an actual 
example to occur before making corrections. Even if there 
is an issue, I think the probability should be very low, 
because the flow for legacy SDB mode has been in use 
for several years.

Thank you again for your review.

Thanks
Peter





^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v10 0/2] fix abort defect
  2024-10-01  9:19 [PATCH v10 0/2] fix abort defect peter.wang
  2024-10-01  9:19 ` [PATCH v10 1/2] ufs: core: fix the issue of ICU failure peter.wang
  2024-10-01  9:19 ` [PATCH v10 2/2] ufs: core: requeue aborted request peter.wang
@ 2024-10-16  2:38 ` Martin K. Petersen
  2 siblings, 0 replies; 13+ messages in thread
From: Martin K. Petersen @ 2024-10-16  2:38 UTC (permalink / raw)
  To: linux-scsi, avri.altman, alim.akhtar, jejb, peter.wang
  Cc: Martin K . Petersen, wsd_upstream, linux-mediatek, chun-hung.wu,
	alice.chao, cc.chou, chaotian.jing, jiajie.hao, powen.kao,
	qilin.tan, lin.gui, tun-yu.yu, eddie.huang, naomi.chu, ed.tsai,
	bvanassche, quic_nguyenb

On Tue, 01 Oct 2024 17:19:15 +0800, peter.wang@mediatek.com wrote:

> V10:
>  - Requeue OCS: ABORTED request in MCQ mode.
> 
> V9:
>  - Revise the OCS content printed.
> 
> V8:
>  - Remove the abort variable to simplify the abort process.
>  - Correct error handler successfully aborts release flow.
>  - Ingore MCQ OCS: ABORTED.
> 
> [...]

Applied to 6.12/scsi-fixes, thanks!

[1/2] ufs: core: fix the issue of ICU failure
      https://git.kernel.org/mkp/scsi/c/bf0c6cc73f7f
[2/2] ufs: core: requeue aborted request
      https://git.kernel.org/mkp/scsi/c/8fa075804cb3

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-10-16  2:40 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-01  9:19 [PATCH v10 0/2] fix abort defect peter.wang
2024-10-01  9:19 ` [PATCH v10 1/2] ufs: core: fix the issue of ICU failure peter.wang
2024-10-01  9:19 ` [PATCH v10 2/2] ufs: core: requeue aborted request peter.wang
2024-10-01 17:13   ` Bart Van Assche
2024-10-02 12:42     ` Peter Wang (王信友)
2024-10-03 20:02       ` Bart Van Assche
2024-10-07  7:20         ` Peter Wang (王信友)
2024-10-08 18:29           ` Bart Van Assche
2024-10-09  2:17             ` Peter Wang (王信友)
2024-10-09 18:06               ` Bart Van Assche
2024-10-11  5:44                 ` Peter Wang (王信友)
2024-10-09 17:59   ` Bart Van Assche
2024-10-16  2:38 ` [PATCH v10 0/2] fix abort defect Martin K. Petersen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox