[PATCH v3] ata: libata-core: do not issue non-internal commands once EH is pending

public inbox for linux-ide@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3] ata: libata-core: do not issue non-internal commands once EH is pending
@ 2022-11-08 23:15 Niklas Cassel
  2022-11-09  5:15 ` Damien Le Moal
  2022-11-11  8:55 ` Damien Le Moal
  0 siblings, 2 replies; 6+ messages in thread
From: Niklas Cassel @ 2022-11-08 23:15 UTC (permalink / raw)
  To: damien.lemoal, tj, hare; +Cc: linux-ide, Niklas Cassel

While the ATA specification states that a device should return command
aborted for all commands queued after the device has entered error state,
since ATA only keeps the sense data for the latest command (in non-NCQ
case), we really don't want to send block layer commands to the device
after it has entered error state. (Only ATA EH commands should be sent,
to read the sense data etc.)

Currently, scsi_queue_rq() will check if scsi_host_in_recovery()
(state is SHOST_RECOVERY), and if so, it will _not_ issue a command via:
scsi_dispatch_cmd() -> host->hostt->queuecommand() (ata_scsi_queuecmd())
-> __ata_scsi_queuecmd() -> ata_scsi_translate() -> ata_qc_issue()

Before commit e494f6a72839 ("[SCSI] improved eh timeout handler"),
when receiving a TFES error IRQ, the call chain looked like this:
ahci_error_intr() -> ata_port_abort() -> ata_do_link_abort() ->
ata_qc_complete() -> ata_qc_schedule_eh() -> blk_abort_request() ->
blk_rq_timed_out() -> q->rq_timed_out_fn() (scsi_times_out()) ->
scsi_eh_scmd_add() -> scsi_host_set_state(shost, SHOST_RECOVERY)

Which meant that as soon as an error IRQ was serviced, SHOST_RECOVERY
would be set.

However, after commit e494f6a72839 ("[SCSI] improved eh timeout handler"),
scsi_times_out() will instead call scsi_abort_command() which will queue
delayed work, and the worker function scmd_eh_abort_handler() will call
scsi_eh_scmd_add(), which calls scsi_host_set_state(shost, SHOST_RECOVERY).

So now, after the TFES error IRQ has been serviced, we need to wait for
the SCSI workqueue to run its work before SHOST_RECOVERY gets set.

It is worth noting that, even before commit e494f6a72839 ("[SCSI] improved
eh timeout handler"), we could receive an error IRQ from the time when
scsi_queue_rq() checks scsi_host_in_recovery(), to the time when
ata_scsi_queuecmd() is actually called.

In order to handle both the delayed setting of SHOST_RECOVERY and the
window where we can receive an error IRQ, add a check against
ATA_PFLAG_EH_PENDING (which gets set when servicing the error IRQ),
inside ata_scsi_queuecmd() itself, while holding the ap->lock.
(Since the ap->lock is held while servicing IRQs.)

Fixes: e494f6a72839 ("[SCSI] improved eh timeout handler")
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
---
Changes since v2:
-Improve commit message and the comment inside the code.
-Move the check to __ata_scsi_queuecmd().

 drivers/ata/libata-scsi.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 4cb914103382..93ebcdf2e354 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -3978,9 +3978,19 @@ static inline ata_xlat_func_t ata_get_xlat_func(struct ata_device *dev, u8 cmd)

 int __ata_scsi_queuecmd(struct scsi_cmnd *scmd, struct ata_device *dev)
 {
+	struct ata_port *ap = dev->link->ap;
 	u8 scsi_op = scmd->cmnd[0];
 	ata_xlat_func_t xlat_func;

+	/*
+	 * scsi_queue_rq() will defer commands if scsi_host_in_recovery().
+	 * However, this check is done without holding the ap->lock (a libata
+	 * specific lock), so we can have received an error irq since then,
+	 * therefore we must check if EH is pending, while holding ap->lock.
+	 */
+	if (ap->pflags & (ATA_PFLAG_EH_PENDING | ATA_PFLAG_EH_IN_PROGRESS))
+		return SCSI_MLQUEUE_DEVICE_BUSY;
+
 	if (unlikely(!scmd->cmd_len))
 		goto bad_cdb_len;

-- 
2.38.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] ata: libata-core: do not issue non-internal commands once EH is pending
  2022-11-08 23:15 [PATCH v3] ata: libata-core: do not issue non-internal commands once EH is pending Niklas Cassel
@ 2022-11-09  5:15 ` Damien Le Moal
  2022-11-09  7:57   ` John Garry
  2022-11-11  8:55 ` Damien Le Moal
  1 sibling, 1 reply; 6+ messages in thread
From: Damien Le Moal @ 2022-11-09  5:15 UTC (permalink / raw)
  To: Niklas Cassel, tj, hare, John Garry; +Cc: linux-ide

On 11/9/22 08:15, Niklas Cassel wrote:
> While the ATA specification states that a device should return command
> aborted for all commands queued after the device has entered error state,
> since ATA only keeps the sense data for the latest command (in non-NCQ
> case), we really don't want to send block layer commands to the device
> after it has entered error state. (Only ATA EH commands should be sent,
> to read the sense data etc.)
> 
> Currently, scsi_queue_rq() will check if scsi_host_in_recovery()
> (state is SHOST_RECOVERY), and if so, it will _not_ issue a command via:
> scsi_dispatch_cmd() -> host->hostt->queuecommand() (ata_scsi_queuecmd())
> -> __ata_scsi_queuecmd() -> ata_scsi_translate() -> ata_qc_issue()
> 
> Before commit e494f6a72839 ("[SCSI] improved eh timeout handler"),
> when receiving a TFES error IRQ, the call chain looked like this:
> ahci_error_intr() -> ata_port_abort() -> ata_do_link_abort() ->
> ata_qc_complete() -> ata_qc_schedule_eh() -> blk_abort_request() ->
> blk_rq_timed_out() -> q->rq_timed_out_fn() (scsi_times_out()) ->
> scsi_eh_scmd_add() -> scsi_host_set_state(shost, SHOST_RECOVERY)
> 
> Which meant that as soon as an error IRQ was serviced, SHOST_RECOVERY
> would be set.
> 
> However, after commit e494f6a72839 ("[SCSI] improved eh timeout handler"),
> scsi_times_out() will instead call scsi_abort_command() which will queue
> delayed work, and the worker function scmd_eh_abort_handler() will call
> scsi_eh_scmd_add(), which calls scsi_host_set_state(shost, SHOST_RECOVERY).
> 
> So now, after the TFES error IRQ has been serviced, we need to wait for
> the SCSI workqueue to run its work before SHOST_RECOVERY gets set.
> 
> It is worth noting that, even before commit e494f6a72839 ("[SCSI] improved
> eh timeout handler"), we could receive an error IRQ from the time when
> scsi_queue_rq() checks scsi_host_in_recovery(), to the time when
> ata_scsi_queuecmd() is actually called.
> 
> In order to handle both the delayed setting of SHOST_RECOVERY and the
> window where we can receive an error IRQ, add a check against
> ATA_PFLAG_EH_PENDING (which gets set when servicing the error IRQ),
> inside ata_scsi_queuecmd() itself, while holding the ap->lock.
> (Since the ap->lock is held while servicing IRQs.)
> 
> Fixes: e494f6a72839 ("[SCSI] improved eh timeout handler")
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> ---
> Changes since v2:
> -Improve commit message and the comment inside the code.
> -Move the check to __ata_scsi_queuecmd().
> 
>  drivers/ata/libata-scsi.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> index 4cb914103382..93ebcdf2e354 100644
> --- a/drivers/ata/libata-scsi.c
> +++ b/drivers/ata/libata-scsi.c
> @@ -3978,9 +3978,19 @@ static inline ata_xlat_func_t ata_get_xlat_func(struct ata_device *dev, u8 cmd)
>  
>  int __ata_scsi_queuecmd(struct scsi_cmnd *scmd, struct ata_device *dev)
>  {
> +	struct ata_port *ap = dev->link->ap;
>  	u8 scsi_op = scmd->cmnd[0];
>  	ata_xlat_func_t xlat_func;
>  
> +	/*
> +	 * scsi_queue_rq() will defer commands if scsi_host_in_recovery().
> +	 * However, this check is done without holding the ap->lock (a libata
> +	 * specific lock), so we can have received an error irq since then,
> +	 * therefore we must check if EH is pending, while holding ap->lock.
> +	 */
> +	if (ap->pflags & (ATA_PFLAG_EH_PENDING | ATA_PFLAG_EH_IN_PROGRESS))
> +		return SCSI_MLQUEUE_DEVICE_BUSY;
> +

I prefer this to the scsi recovery state approach.

John,

Can you test this with libsas ?


>  	if (unlikely(!scmd->cmd_len))
>  		goto bad_cdb_len;
>  

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] ata: libata-core: do not issue non-internal commands once EH is pending
  2022-11-09  5:15 ` Damien Le Moal
@ 2022-11-09  7:57   ` John Garry
  0 siblings, 0 replies; 6+ messages in thread
From: John Garry @ 2022-11-09  7:57 UTC (permalink / raw)
  To: Damien Le Moal, Niklas Cassel, tj, hare; +Cc: linux-ide, chenxiang66

On 09/11/2022 05:15, Damien Le Moal wrote:

+

> On 11/9/22 08:15, Niklas Cassel wrote:
>> While the ATA specification states that a device should return command
>> aborted for all commands queued after the device has entered error state,
>> since ATA only keeps the sense data for the latest command (in non-NCQ
>> case), we really don't want to send block layer commands to the device
>> after it has entered error state. (Only ATA EH commands should be sent,
>> to read the sense data etc.)
>>
>> Currently, scsi_queue_rq() will check if scsi_host_in_recovery()
>> (state is SHOST_RECOVERY), and if so, it will _not_ issue a command via:
>> scsi_dispatch_cmd() -> host->hostt->queuecommand() (ata_scsi_queuecmd())
>> -> __ata_scsi_queuecmd() -> ata_scsi_translate() -> ata_qc_issue()
>>
>> Before commit e494f6a72839 ("[SCSI] improved eh timeout handler"),
>> when receiving a TFES error IRQ, the call chain looked like this:
>> ahci_error_intr() -> ata_port_abort() -> ata_do_link_abort() ->
>> ata_qc_complete() -> ata_qc_schedule_eh() -> blk_abort_request() ->
>> blk_rq_timed_out() -> q->rq_timed_out_fn() (scsi_times_out()) ->
>> scsi_eh_scmd_add() -> scsi_host_set_state(shost, SHOST_RECOVERY)
>>
>> Which meant that as soon as an error IRQ was serviced, SHOST_RECOVERY
>> would be set.
>>
>> However, after commit e494f6a72839 ("[SCSI] improved eh timeout handler"),
>> scsi_times_out() will instead call scsi_abort_command() which will queue
>> delayed work, and the worker function scmd_eh_abort_handler() will call
>> scsi_eh_scmd_add(), which calls scsi_host_set_state(shost, SHOST_RECOVERY).
>>
>> So now, after the TFES error IRQ has been serviced, we need to wait for
>> the SCSI workqueue to run its work before SHOST_RECOVERY gets set.
>>
>> It is worth noting that, even before commit e494f6a72839 ("[SCSI] improved
>> eh timeout handler"), we could receive an error IRQ from the time when
>> scsi_queue_rq() checks scsi_host_in_recovery(), to the time when
>> ata_scsi_queuecmd() is actually called.
>>
>> In order to handle both the delayed setting of SHOST_RECOVERY and the
>> window where we can receive an error IRQ, add a check against
>> ATA_PFLAG_EH_PENDING (which gets set when servicing the error IRQ),
>> inside ata_scsi_queuecmd() itself, while holding the ap->lock.
>> (Since the ap->lock is held while servicing IRQs.)
>>
>> Fixes: e494f6a72839 ("[SCSI] improved eh timeout handler")
>> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
>> ---
>> Changes since v2:
>> -Improve commit message and the comment inside the code.
>> -Move the check to __ata_scsi_queuecmd().
>>
>>   drivers/ata/libata-scsi.c | 10 ++++++++++
>>   1 file changed, 10 insertions(+)
>>
>> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
>> index 4cb914103382..93ebcdf2e354 100644
>> --- a/drivers/ata/libata-scsi.c
>> +++ b/drivers/ata/libata-scsi.c
>> @@ -3978,9 +3978,19 @@ static inline ata_xlat_func_t ata_get_xlat_func(struct ata_device *dev, u8 cmd)
>>   
>>   int __ata_scsi_queuecmd(struct scsi_cmnd *scmd, struct ata_device *dev)
>>   {
>> +	struct ata_port *ap = dev->link->ap;
>>   	u8 scsi_op = scmd->cmnd[0];
>>   	ata_xlat_func_t xlat_func;
>>   
>> +	/*
>> +	 * scsi_queue_rq() will defer commands if scsi_host_in_recovery().
>> +	 * However, this check is done without holding the ap->lock (a libata
>> +	 * specific lock), so we can have received an error irq since then,
>> +	 * therefore we must check if EH is pending, while holding ap->lock.
>> +	 */
>> +	if (ap->pflags & (ATA_PFLAG_EH_PENDING | ATA_PFLAG_EH_IN_PROGRESS))
>> +		return SCSI_MLQUEUE_DEVICE_BUSY;
>> +
> 
> I prefer this to the scsi recovery state approach.
> 
> John,
> 
> Can you test this with libsas ?

I have no HW access currently, sorry. Maybe Xiang Chen, cc'ed, can assist.

Thanks,
John

> 
> 
>>   	if (unlikely(!scmd->cmd_len))
>>   		goto bad_cdb_len;
>>   
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] ata: libata-core: do not issue non-internal commands once EH is pending
  2022-11-08 23:15 [PATCH v3] ata: libata-core: do not issue non-internal commands once EH is pending Niklas Cassel
  2022-11-09  5:15 ` Damien Le Moal
@ 2022-11-11  8:55 ` Damien Le Moal
  2022-11-11 10:11   ` John Garry
  1 sibling, 1 reply; 6+ messages in thread
From: Damien Le Moal @ 2022-11-11  8:55 UTC (permalink / raw)
  To: Niklas Cassel, tj, hare; +Cc: linux-ide

On 11/9/22 08:15, Niklas Cassel wrote:
> While the ATA specification states that a device should return command
> aborted for all commands queued after the device has entered error state,
> since ATA only keeps the sense data for the latest command (in non-NCQ
> case), we really don't want to send block layer commands to the device
> after it has entered error state. (Only ATA EH commands should be sent,
> to read the sense data etc.)
> 
> Currently, scsi_queue_rq() will check if scsi_host_in_recovery()
> (state is SHOST_RECOVERY), and if so, it will _not_ issue a command via:
> scsi_dispatch_cmd() -> host->hostt->queuecommand() (ata_scsi_queuecmd())
> -> __ata_scsi_queuecmd() -> ata_scsi_translate() -> ata_qc_issue()
> 
> Before commit e494f6a72839 ("[SCSI] improved eh timeout handler"),
> when receiving a TFES error IRQ, the call chain looked like this:
> ahci_error_intr() -> ata_port_abort() -> ata_do_link_abort() ->
> ata_qc_complete() -> ata_qc_schedule_eh() -> blk_abort_request() ->
> blk_rq_timed_out() -> q->rq_timed_out_fn() (scsi_times_out()) ->
> scsi_eh_scmd_add() -> scsi_host_set_state(shost, SHOST_RECOVERY)
> 
> Which meant that as soon as an error IRQ was serviced, SHOST_RECOVERY
> would be set.
> 
> However, after commit e494f6a72839 ("[SCSI] improved eh timeout handler"),
> scsi_times_out() will instead call scsi_abort_command() which will queue
> delayed work, and the worker function scmd_eh_abort_handler() will call
> scsi_eh_scmd_add(), which calls scsi_host_set_state(shost, SHOST_RECOVERY).
> 
> So now, after the TFES error IRQ has been serviced, we need to wait for
> the SCSI workqueue to run its work before SHOST_RECOVERY gets set.
> 
> It is worth noting that, even before commit e494f6a72839 ("[SCSI] improved
> eh timeout handler"), we could receive an error IRQ from the time when
> scsi_queue_rq() checks scsi_host_in_recovery(), to the time when
> ata_scsi_queuecmd() is actually called.
> 
> In order to handle both the delayed setting of SHOST_RECOVERY and the
> window where we can receive an error IRQ, add a check against
> ATA_PFLAG_EH_PENDING (which gets set when servicing the error IRQ),
> inside ata_scsi_queuecmd() itself, while holding the ap->lock.
> (Since the ap->lock is held while servicing IRQs.)
> 
> Fixes: e494f6a72839 ("[SCSI] improved eh timeout handler")
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>

Applied to for-6.1-fixes. Thanks !

> ---
> Changes since v2:
> -Improve commit message and the comment inside the code.
> -Move the check to __ata_scsi_queuecmd().
> 
>  drivers/ata/libata-scsi.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> index 4cb914103382..93ebcdf2e354 100644
> --- a/drivers/ata/libata-scsi.c
> +++ b/drivers/ata/libata-scsi.c
> @@ -3978,9 +3978,19 @@ static inline ata_xlat_func_t ata_get_xlat_func(struct ata_device *dev, u8 cmd)
>  
>  int __ata_scsi_queuecmd(struct scsi_cmnd *scmd, struct ata_device *dev)
>  {
> +	struct ata_port *ap = dev->link->ap;
>  	u8 scsi_op = scmd->cmnd[0];
>  	ata_xlat_func_t xlat_func;
>  
> +	/*
> +	 * scsi_queue_rq() will defer commands if scsi_host_in_recovery().
> +	 * However, this check is done without holding the ap->lock (a libata
> +	 * specific lock), so we can have received an error irq since then,
> +	 * therefore we must check if EH is pending, while holding ap->lock.
> +	 */
> +	if (ap->pflags & (ATA_PFLAG_EH_PENDING | ATA_PFLAG_EH_IN_PROGRESS))
> +		return SCSI_MLQUEUE_DEVICE_BUSY;
> +
>  	if (unlikely(!scmd->cmd_len))
>  		goto bad_cdb_len;
>  

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] ata: libata-core: do not issue non-internal commands once EH is pending
  2022-11-11  8:55 ` Damien Le Moal
@ 2022-11-11 10:11   ` John Garry
  2022-11-12  1:41     ` Damien Le Moal
  0 siblings, 1 reply; 6+ messages in thread
From: John Garry @ 2022-11-11 10:11 UTC (permalink / raw)
  To: Damien Le Moal, Niklas Cassel, tj, hare; +Cc: linux-ide

On 11/11/2022 08:55, Damien Le Moal wrote:
>> Fixes: e494f6a72839 ("[SCSI] improved eh timeout handler")
>> Signed-off-by: Niklas Cassel<niklas.cassel@wdc.com>
> Applied to for-6.1-fixes. Thanks !
> 

If not too late, I tested this with that NCQ QEMU model again:

Tested-by: John Garry <john.g.garry@oracle.com>

Thanks!

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] ata: libata-core: do not issue non-internal commands once EH is pending
  2022-11-11 10:11   ` John Garry
@ 2022-11-12  1:41     ` Damien Le Moal
  0 siblings, 0 replies; 6+ messages in thread
From: Damien Le Moal @ 2022-11-12  1:41 UTC (permalink / raw)
  To: John Garry, Niklas Cassel, tj, hare; +Cc: linux-ide

On 11/11/22 19:11, John Garry wrote:
> On 11/11/2022 08:55, Damien Le Moal wrote:
>>> Fixes: e494f6a72839 ("[SCSI] improved eh timeout handler")
>>> Signed-off-by: Niklas Cassel<niklas.cassel@wdc.com>
>> Applied to for-6.1-fixes. Thanks !
>>
> 
> If not too late, I tested this with that NCQ QEMU model again:
> 
> Tested-by: John Garry <john.g.garry@oracle.com>
> 
> Thanks!

Not too late. Thanks !

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-11-12  1:41 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-11-08 23:15 [PATCH v3] ata: libata-core: do not issue non-internal commands once EH is pending Niklas Cassel
2022-11-09  5:15 ` Damien Le Moal
2022-11-09  7:57   ` John Garry
2022-11-11  8:55 ` Damien Le Moal
2022-11-11 10:11   ` John Garry
2022-11-12  1:41     ` Damien Le Moal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox