[PATCH 0/2] libata NCQ error handling fix and improvement

public inbox for linux-ide@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] libata NCQ error handling fix and improvement
@ 2022-11-11 11:09 Niklas Cassel
  2022-11-11 11:09 ` [PATCH 1/2] ata: libata: only mark a single command as error during a NCQ error Niklas Cassel
  2022-11-11 11:09 ` [PATCH 2/2] ata: libata: skip error analysis for commands that are not errors Niklas Cassel
  0 siblings, 2 replies; 6+ messages in thread
From: Niklas Cassel @ 2022-11-11 11:09 UTC (permalink / raw)
  To: Damien Le Moal, Niklas Cassel; +Cc: linux-ide

Hello there,

The first patch fixes an error where non-offending commands during an
NCQ error might not get retried.

The second patch skips the error analysis for non-offending commands
during an NCQ error completely. The command that caused the NCQ command
will still get further analysis, so this single QC will still inherit
link level errors, if e.g. SError signaled a BUS error at the same time
as the IRQ signaled a NCQ error.

Kind regards,
Niklas

Niklas Cassel (2):
  ata: libata: only mark a single command as error during a NCQ error
  ata: libata: skip error analysis for commands that are not errors

 drivers/ata/libata-eh.c   |  1 +
 drivers/ata/libata-sata.c | 27 +++++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

-- 
2.38.1

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/2] ata: libata: only mark a single command as error during a NCQ error
  2022-11-11 11:09 [PATCH 0/2] libata NCQ error handling fix and improvement Niklas Cassel
@ 2022-11-11 11:09 ` Niklas Cassel
  2022-11-14  2:11   ` Damien Le Moal
  2022-11-11 11:09 ` [PATCH 2/2] ata: libata: skip error analysis for commands that are not errors Niklas Cassel
  1 sibling, 1 reply; 6+ messages in thread
From: Niklas Cassel @ 2022-11-11 11:09 UTC (permalink / raw)
  To: Damien Le Moal, Niklas Cassel; +Cc: linux-ide

A NCQ error means that the device has aborted a single NCQ command and
halted further processing of queued commands.
To get the single NCQ command that caused the NCQ error, host software has
to read the NCQ error log, which also takes the device out of error state.

When the device encounters a NCQ error, we receive an error interrupt from
the HBA, and call ata_do_link_abort() to mark all outstanding commands on
the link as ATA_QCFLAG_FAILED (which means that these commands are owned
by libata EH), and then call ata_qc_complete() on them.

ata_qc_complete() will call fill_result_tf() for all commands marked as
ATA_QCFLAG_FAILED.

The taskfile is simply the latest status/error as seen from the device's
perspective. The taskfile will have ATA_ERR set in the status field and
ATA_ABORTED set in the error field.

When we fill the current taskfile values for all outstanding commands,
that means that qc->result_tf will have ATA_ERR set for all commands
owned by libata EH.

When ata_eh_link_autopsy() later analyzes all commands owned by libata EH,
it will call ata_eh_analyze_tf(), which will check if qc->result_tf has
ATA_ERR set, if it does, it will set qc->err_mask (which marks the command
as an error).

When ata_eh_finish() later calls __ata_qc_complete() on all commands owned
by libata EH, it will call qc->complete_fn() (ata_scsi_qc_complete()),
ata_scsi_qc_complete() will call ata_gen_ata_sense() to generate sense
data if qc->err_mask is set.

This means that we will generate sense data for commands that really
should not have any sense data set. Having sense data set might cause SCSI
to finish these commands instead of retrying them.

While this incorrect behavior has existed for a long time, this first
became a problem once we started reading the correct taskfile register in
commit 4ba09d202657 ("ata: libahci: read correct status and error field
for NCQ commands").

Before this commit, NCQ commands would read the taskfile values received
from the last non-NCQ command completion, which most likely did not have
ATA_ERR set, since the last non-NCQ command was most likely not an error.

Fix this by clearing ATA_ERR and any error bits for all commands except
the actual command that caused the NCQ error, since the error bits in the
taskfile are not applicable to them.

Fixes: 4ba09d202657 ("ata: libahci: read correct status and error field for NCQ commands")
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
---
 drivers/ata/libata-sata.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/ata/libata-sata.c b/drivers/ata/libata-sata.c
index 283ce1ab29cf..6b2dcf3eb2fb 100644
--- a/drivers/ata/libata-sata.c
+++ b/drivers/ata/libata-sata.c
@@ -1476,6 +1476,25 @@ void ata_eh_analyze_ncq_error(struct ata_link *link)
 		}
 	}

+	ata_qc_for_each_raw(ap, qc, tag) {
+		if (!(qc->flags & ATA_QCFLAG_FAILED) ||
+		    ata_dev_phys_link(qc->dev) != link)
+			continue;
+
+		/* Skip the single QC which caused the NCQ error. */
+		if (qc->err_mask)
+			continue;
+
+		/*
+		 * For SATA, the STATUS and ERROR fields are shared for all NCQ
+		 * commands that were completed with the same SDB FIS.
+		 * Therefore, we have to clear the ATA_ERR bit for all QCs
+		 * except the one that caused the NCQ error.
+		 */
+		qc->result_tf.status &= ~ATA_ERR;
+		qc->result_tf.error = 0;
+	}
+
 	ehc->i.err_mask &= ~AC_ERR_DEV;
 }
 EXPORT_SYMBOL_GPL(ata_eh_analyze_ncq_error);
-- 
2.38.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/2] ata: libata: only mark a single command as error during a NCQ error
  2022-11-11 11:09 ` [PATCH 1/2] ata: libata: only mark a single command as error during a NCQ error Niklas Cassel
@ 2022-11-14  2:11   ` Damien Le Moal
  2022-11-14 17:19     ` Niklas Cassel
  0 siblings, 1 reply; 6+ messages in thread
From: Damien Le Moal @ 2022-11-14  2:11 UTC (permalink / raw)
  To: Niklas Cassel; +Cc: linux-ide

On 11/11/22 20:09, Niklas Cassel wrote:
> A NCQ error means that the device has aborted a single NCQ command and
> halted further processing of queued commands.

Nit: To be strict with wording, this should say something like "an NCQ
command failure results in the device aborting the execution of all active
commands."

> To get the single NCQ command that caused the NCQ error, host software has
> to read the NCQ error log, which also takes the device out of error state.
> 
> When the device encounters a NCQ error, we receive an error interrupt from
> the HBA, and call ata_do_link_abort() to mark all outstanding commands on
> the link as ATA_QCFLAG_FAILED (which means that these commands are owned
> by libata EH), and then call ata_qc_complete() on them.
> 
> ata_qc_complete() will call fill_result_tf() for all commands marked as
> ATA_QCFLAG_FAILED.
> 
> The taskfile is simply the latest status/error as seen from the device's
> perspective. The taskfile will have ATA_ERR set in the status field and
> ATA_ABORTED set in the error field.
> 
> When we fill the current taskfile values for all outstanding commands,
> that means that qc->result_tf will have ATA_ERR set for all commands
> owned by libata EH.
> 
> When ata_eh_link_autopsy() later analyzes all commands owned by libata EH,
> it will call ata_eh_analyze_tf(), which will check if qc->result_tf has
> ATA_ERR set, if it does, it will set qc->err_mask (which marks the command
> as an error).
> 
> When ata_eh_finish() later calls __ata_qc_complete() on all commands owned
> by libata EH, it will call qc->complete_fn() (ata_scsi_qc_complete()),
> ata_scsi_qc_complete() will call ata_gen_ata_sense() to generate sense
> data if qc->err_mask is set.
> 
> This means that we will generate sense data for commands that really
> should not have any sense data set. Having sense data set might cause SCSI
> to finish these commands instead of retrying them.
> 
> While this incorrect behavior has existed for a long time, this first
> became a problem once we started reading the correct taskfile register in
> commit 4ba09d202657 ("ata: libahci: read correct status and error field
> for NCQ commands").
> 
> Before this commit, NCQ commands would read the taskfile values received
> from the last non-NCQ command completion, which most likely did not have
> ATA_ERR set, since the last non-NCQ command was most likely not an error.
> 
> Fix this by clearing ATA_ERR and any error bits for all commands except
> the actual command that caused the NCQ error, since the error bits in the
> taskfile are not applicable to them.
> 
> Fixes: 4ba09d202657 ("ata: libahci: read correct status and error field for NCQ commands")
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> ---
>  drivers/ata/libata-sata.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/drivers/ata/libata-sata.c b/drivers/ata/libata-sata.c
> index 283ce1ab29cf..6b2dcf3eb2fb 100644
> --- a/drivers/ata/libata-sata.c
> +++ b/drivers/ata/libata-sata.c
> @@ -1476,6 +1476,25 @@ void ata_eh_analyze_ncq_error(struct ata_link *link)
>  		}
>  	}
>  
> +	ata_qc_for_each_raw(ap, qc, tag) {
> +		if (!(qc->flags & ATA_QCFLAG_FAILED) ||
> +		    ata_dev_phys_link(qc->dev) != link)
> +			continue;
> +
> +		/* Skip the single QC which caused the NCQ error. */
> +		if (qc->err_mask)
> +			continue;

Before the continue, should we check that this qc tag is the one we got
from ata_eh_read_log_10h() ? We should at least warn if there is a mismatch.

> +
> +		/*
> +		 * For SATA, the STATUS and ERROR fields are shared for all NCQ
> +		 * commands that were completed with the same SDB FIS.
> +		 * Therefore, we have to clear the ATA_ERR bit for all QCs
> +		 * except the one that caused the NCQ error.
> +		 */
> +		qc->result_tf.status &= ~ATA_ERR;
> +		qc->result_tf.error = 0;
> +	}
> +
>  	ehc->i.err_mask &= ~AC_ERR_DEV;
>  }
>  EXPORT_SYMBOL_GPL(ata_eh_analyze_ncq_error);

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/2] ata: libata: only mark a single command as error during a NCQ error
  2022-11-14  2:11   ` Damien Le Moal
@ 2022-11-14 17:19     ` Niklas Cassel
  0 siblings, 0 replies; 6+ messages in thread
From: Niklas Cassel @ 2022-11-14 17:19 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: linux-ide@vger.kernel.org

On Mon, Nov 14, 2022 at 11:11:26AM +0900, Damien Le Moal wrote:
> On 11/11/22 20:09, Niklas Cassel wrote:
> > A NCQ error means that the device has aborted a single NCQ command and
> > halted further processing of queued commands.
>
> Nit: To be strict with wording, this should say something like "an NCQ
> command failure results in the device aborting the execution of all active
> commands."
>
> > To get the single NCQ command that caused the NCQ error, host software has
> > to read the NCQ error log, which also takes the device out of error state.
> >
> > When the device encounters a NCQ error, we receive an error interrupt from
> > the HBA, and call ata_do_link_abort() to mark all outstanding commands on
> > the link as ATA_QCFLAG_FAILED (which means that these commands are owned
> > by libata EH), and then call ata_qc_complete() on them.
> >
> > ata_qc_complete() will call fill_result_tf() for all commands marked as
> > ATA_QCFLAG_FAILED.
> >
> > The taskfile is simply the latest status/error as seen from the device's
> > perspective. The taskfile will have ATA_ERR set in the status field and
> > ATA_ABORTED set in the error field.
> >
> > When we fill the current taskfile values for all outstanding commands,
> > that means that qc->result_tf will have ATA_ERR set for all commands
> > owned by libata EH.
> >
> > When ata_eh_link_autopsy() later analyzes all commands owned by libata EH,
> > it will call ata_eh_analyze_tf(), which will check if qc->result_tf has
> > ATA_ERR set, if it does, it will set qc->err_mask (which marks the command
> > as an error).
> >
> > When ata_eh_finish() later calls __ata_qc_complete() on all commands owned
> > by libata EH, it will call qc->complete_fn() (ata_scsi_qc_complete()),
> > ata_scsi_qc_complete() will call ata_gen_ata_sense() to generate sense
> > data if qc->err_mask is set.
> >
> > This means that we will generate sense data for commands that really
> > should not have any sense data set. Having sense data set might cause SCSI
> > to finish these commands instead of retrying them.
> >
> > While this incorrect behavior has existed for a long time, this first
> > became a problem once we started reading the correct taskfile register in
> > commit 4ba09d202657 ("ata: libahci: read correct status and error field
> > for NCQ commands").
> >
> > Before this commit, NCQ commands would read the taskfile values received
> > from the last non-NCQ command completion, which most likely did not have
> > ATA_ERR set, since the last non-NCQ command was most likely not an error.
> >
> > Fix this by clearing ATA_ERR and any error bits for all commands except
> > the actual command that caused the NCQ error, since the error bits in the
> > taskfile are not applicable to them.
> >
> > Fixes: 4ba09d202657 ("ata: libahci: read correct status and error field for NCQ commands")
> > Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> > ---
> >  drivers/ata/libata-sata.c | 19 +++++++++++++++++++
> >  1 file changed, 19 insertions(+)
> >
> > diff --git a/drivers/ata/libata-sata.c b/drivers/ata/libata-sata.c
> > index 283ce1ab29cf..6b2dcf3eb2fb 100644
> > --- a/drivers/ata/libata-sata.c
> > +++ b/drivers/ata/libata-sata.c
> > @@ -1476,6 +1476,25 @@ void ata_eh_analyze_ncq_error(struct ata_link *link)
> >		}
> >	}
> >
> > +	ata_qc_for_each_raw(ap, qc, tag) {
> > +		if (!(qc->flags & ATA_QCFLAG_FAILED) ||
> > +		    ata_dev_phys_link(qc->dev) != link)
> > +			continue;
> > +
> > +		/* Skip the single QC which caused the NCQ error. */
> > +		if (qc->err_mask)
> > +			continue;
>
> Before the continue, should we check that this qc tag is the one we got
> from ata_eh_read_log_10h() ? We should at least warn if there is a mismatch.

I really see no point of displaying a warning in case of mismatch here.

At this point, there will be exactly one command that has AC_ERR_NCQ set.
If ata_eh_read_log_10h() reported an invalid tag, we would have returned
in the check performed directly after ata_eh_read_log_10h() was called:

	if (!(link->sactive & (1 << tag))) {
		ata_link_err(link, "log page 10h reported inactive tag %d\n",
			     tag);
		return;
	}

Which means that we would never have reached this new code.

However, there could theoretically be another command that has e.g.
AC_ERR_TIMEOUT set, if there was a command that timed out just before
the NCQ error, so EH did not yet have a change to run before handling
both errors at the same time.

Therefore I wrote:
+           if (qc->err_mask)
+                   continue;

Instead of:
+           if (qc->err_mask & AC_ERR_NCQ)
+                   continue;


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 2/2] ata: libata: skip error analysis for commands that are not errors
  2022-11-11 11:09 [PATCH 0/2] libata NCQ error handling fix and improvement Niklas Cassel
  2022-11-11 11:09 ` [PATCH 1/2] ata: libata: only mark a single command as error during a NCQ error Niklas Cassel
@ 2022-11-11 11:09 ` Niklas Cassel
  2022-11-14  2:15   ` Damien Le Moal
  1 sibling, 1 reply; 6+ messages in thread
From: Niklas Cassel @ 2022-11-11 11:09 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: Niklas Cassel, linux-ide

A NCQ error means that the device has aborted a single NCQ command and
halted further processing of queued commands.
To get the single NCQ command that caused the NCQ error, host software has
to read the NCQ error log, which also takes the device out of error state.

ata_eh_link_autopsy() starts off by calling ata_eh_analyze_ncq_error() to
read the NCQ error log, to find the offending command that caused the NCQ
error. ata_eh_analyze_ncq_error() marks the offending command by setting
qc->err_mask to AC_ERR_NCQ.

ata_eh_link_autopsy() then continues with further analysis for all
commands owned by libata EH.

However, once we have found the offending command, we know that all other
commands cannot be an error. (Theoretically a command could have timed out
just before the NCQ error happened (so EH was scheduled but did not yet
run), such command will have AC_ERR_TIMEOUT set in qc->err_mask.)

Therefore, after finding the offending command, we know that we can simply
skip the per command analysis for all commands that have not been marked
as error at this point, since we know that they have to be retried anyway.

Do this by changing ata_eh_analyze_ncq_error() to mark all non-failed
commands as ATA_QCFLAG_RETRY, and change the loop in ata_eh_link_autopsy()
to skip commands marked as ATA_QCFLAG_RETRY.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
---
Flag ATA_QCFLAG_FAILED is kept since it means that the command is owned
by libata EH. A failed command will always have qc->err_mask set.
Perhaps flag ATA_QCFLAG_FAILED should be renamed to something like
ATA_QCFLAG_OWNED_BY_EH to further clarify this meaning, and that the flag
does not necessarily mean that it is an error.

 drivers/ata/libata-eh.c   | 1 +
 drivers/ata/libata-sata.c | 8 ++++++++
 2 files changed, 9 insertions(+)

diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index bde15f855f70..34303ce67c14 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -1955,6 +1955,7 @@ static void ata_eh_link_autopsy(struct ata_link *link)

 	ata_qc_for_each_raw(ap, qc, tag) {
 		if (!(qc->flags & ATA_QCFLAG_FAILED) ||
+		    qc->flags & ATA_QCFLAG_RETRY ||
 		    ata_dev_phys_link(qc->dev) != link)
 			continue;

diff --git a/drivers/ata/libata-sata.c b/drivers/ata/libata-sata.c
index 6b2dcf3eb2fb..18ef14e749a0 100644
--- a/drivers/ata/libata-sata.c
+++ b/drivers/ata/libata-sata.c
@@ -1493,6 +1493,14 @@ void ata_eh_analyze_ncq_error(struct ata_link *link)
 		 */
 		qc->result_tf.status &= ~ATA_ERR;
 		qc->result_tf.error = 0;
+
+		/*
+		 * If we get a NCQ error, that means that a single command was
+		 * aborted. All other failed commands for our link should be
+		 * retried and has no business of going though further scrutiny
+		 * by ata_eh_link_autopsy().
+		 */
+		qc->flags |= ATA_QCFLAG_RETRY;
 	}

 	ehc->i.err_mask &= ~AC_ERR_DEV;
-- 
2.38.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] ata: libata: skip error analysis for commands that are not errors
  2022-11-11 11:09 ` [PATCH 2/2] ata: libata: skip error analysis for commands that are not errors Niklas Cassel
@ 2022-11-14  2:15   ` Damien Le Moal
  0 siblings, 0 replies; 6+ messages in thread
From: Damien Le Moal @ 2022-11-14  2:15 UTC (permalink / raw)
  To: Niklas Cassel; +Cc: linux-ide

On 11/11/22 20:09, Niklas Cassel wrote:
> A NCQ error means that the device has aborted a single NCQ command and
> halted further processing of queued commands.

...means that the device has aborted processing of all active commands.

> To get the single NCQ command that caused the NCQ error, host software has
> to read the NCQ error log, which also takes the device out of error state.
> 
> ata_eh_link_autopsy() starts off by calling ata_eh_analyze_ncq_error() to
> read the NCQ error log, to find the offending command that caused the NCQ
> error. ata_eh_analyze_ncq_error() marks the offending command by setting
> qc->err_mask to AC_ERR_NCQ.
> 
> ata_eh_link_autopsy() then continues with further analysis for all
> commands owned by libata EH.
> 
> However, once we have found the offending command, we know that all other
> commands cannot be an error. (Theoretically a command could have timed out
> just before the NCQ error happened (so EH was scheduled but did not yet
> run), such command will have AC_ERR_TIMEOUT set in qc->err_mask.)
> 
> Therefore, after finding the offending command, we know that we can simply
> skip the per command analysis for all commands that have not been marked
> as error at this point, since we know that they have to be retried anyway.
> 
> Do this by changing ata_eh_analyze_ncq_error() to mark all non-failed
> commands as ATA_QCFLAG_RETRY, and change the loop in ata_eh_link_autopsy()
> to skip commands marked as ATA_QCFLAG_RETRY.
> 
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> ---
> Flag ATA_QCFLAG_FAILED is kept since it means that the command is owned
> by libata EH. A failed command will always have qc->err_mask set.
> Perhaps flag ATA_QCFLAG_FAILED should be renamed to something like
> ATA_QCFLAG_OWNED_BY_EH to further clarify this meaning, and that the flag
> does not necessarily mean that it is an error.
> 
>  drivers/ata/libata-eh.c   | 1 +
>  drivers/ata/libata-sata.c | 8 ++++++++
>  2 files changed, 9 insertions(+)
> 
> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
> index bde15f855f70..34303ce67c14 100644
> --- a/drivers/ata/libata-eh.c
> +++ b/drivers/ata/libata-eh.c
> @@ -1955,6 +1955,7 @@ static void ata_eh_link_autopsy(struct ata_link *link)
>  
>  	ata_qc_for_each_raw(ap, qc, tag) {
>  		if (!(qc->flags & ATA_QCFLAG_FAILED) ||
> +		    qc->flags & ATA_QCFLAG_RETRY ||
>  		    ata_dev_phys_link(qc->dev) != link)
>  			continue;
>  
> diff --git a/drivers/ata/libata-sata.c b/drivers/ata/libata-sata.c
> index 6b2dcf3eb2fb..18ef14e749a0 100644
> --- a/drivers/ata/libata-sata.c
> +++ b/drivers/ata/libata-sata.c
> @@ -1493,6 +1493,14 @@ void ata_eh_analyze_ncq_error(struct ata_link *link)
>  		 */
>  		qc->result_tf.status &= ~ATA_ERR;
>  		qc->result_tf.error = 0;
> +
> +		/*
> +		 * If we get a NCQ error, that means that a single command was
> +		 * aborted. All other failed commands for our link should be
> +		 * retried and has no business of going though further scrutiny
> +		 * by ata_eh_link_autopsy().
> +		 */
> +		qc->flags |= ATA_QCFLAG_RETRY;
>  	}
>  
>  	ehc->i.err_mask &= ~AC_ERR_DEV;

I think this patch should be squashed with the previous one. That will
really correctly fix the EH handling of NCQ errors.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-11-14 17:19 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-11-11 11:09 [PATCH 0/2] libata NCQ error handling fix and improvement Niklas Cassel
2022-11-11 11:09 ` [PATCH 1/2] ata: libata: only mark a single command as error during a NCQ error Niklas Cassel
2022-11-14  2:11   ` Damien Le Moal
2022-11-14 17:19     ` Niklas Cassel
2022-11-11 11:09 ` [PATCH 2/2] ata: libata: skip error analysis for commands that are not errors Niklas Cassel
2022-11-14  2:15   ` Damien Le Moal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox