Re: [PATCH v4] ata: libata-eh: Honor all EH scheduling requests

public inbox for linux-ide@vger.kernel.org
 help / color / mirror / Atom feed

From: Damien Le Moal <dlemoal@kernel.org>
To: Niklas Cassel <Niklas.Cassel@wdc.com>,
	kernel test robot <oliver.sang@intel.com>
Cc: "linan666@huaweicloud.com" <linan666@huaweicloud.com>,
	"oe-lkp@lists.linux.dev" <oe-lkp@lists.linux.dev>,
	"lkp@intel.com" <lkp@intel.com>, luojian <luojian5@huawei.com>,
	"linux-ide@vger.kernel.org" <linux-ide@vger.kernel.org>,
	"htejun@gmail.com" <htejun@gmail.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linan122@huawei.com" <linan122@huawei.com>,
	"yukuai3@huawei.com" <yukuai3@huawei.com>,
	"yi.zhang@huawei.com" <yi.zhang@huawei.com>,
	"houtao1@huawei.com" <houtao1@huawei.com>,
	"yangerkun@huawei.com" <yangerkun@huawei.com>
Subject: Re: [PATCH v4] ata: libata-eh: Honor all EH scheduling requests
Date: Mon, 11 Sep 2023 16:16:55 +0900	[thread overview]
Message-ID: <d88625ca-9bc7-cf33-2fa7-9e71d4153e7f@kernel.org> (raw)
In-Reply-To: <ZPo6fXqTbmwDyopr@x1-carbon>

On 9/8/23 06:02, Niklas Cassel wrote:
> On Thu, Sep 07, 2023 at 03:43:19PM +0800, kernel test robot wrote:
>>
>>
>> Hello,
>>
>> kernel test robot noticed "kernel_BUG_at_drivers/ata/libata-sff.c" on:
>>
>> commit: d3d099d5c2dd38db84abd96df39f9f0828c16b7b ("[PATCH v4] ata: libata-eh: Honor all EH scheduling requests")
>> url: https://github.com/intel-lab-lkp/linux/commits/linan666-huaweicloud-com/ata-libata-eh-Honor-all-EH-scheduling-requests/20230906-164907
>> base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 65d6e954e37872fd9afb5ef3fc0481bb3c2f20f4
>> patch link: https://lore.kernel.org/all/20230906084212.1016634-1-linan666@huaweicloud.com/
>> patch subject: [PATCH v4] ata: libata-eh: Honor all EH scheduling requests
>>
>> in testcase: boot
>>
>> compiler: gcc-12
>> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> 
> Unfortunately the problem reported by the kernel test robot is very real.
> I could reproduce without too much effort in QEMU.
> 
> The problem is basically that we cannot simply perform a host_eh_scheduled--;
> in ata_std_end_eh().
> 
> ata_std_end_eh() is called at the end of ata_scsi_port_error_handler(),
> so it is called once every time ata_scsi_port_error_handler() is called.
> 
> However, ata_scsi_port_error_handler() will be called by SCSI EH each
> time SCSI wakes up.
> 
> SCSI EH will sleep as long as:
> if ((shost->host_failed == 0 && shost->host_eh_scheduled == 0) ||
>                     shost->host_failed != scsi_host_busy(shost)) {
> 	schedule();
> 	continue;
> }
> 
> 
> The methods in libata which we use to trigger EH are:
> 
> 1) ata_std_sched_eh(), which calls scsi_schedule_eh(), which does
> host_eh_scheduled++;
> 
> 2) ata_qc_schedule_eh(), which will end up in scsi_timeout,
> which calls scsi_eh_scmd_add() which does:
> host_failed++;
> 
> 
> So before this patch, setting host_eh_scheduled = 0; in ata_std_end_eh()
> makes us say that works because it only negates the EH scheduled by
> ata_std_sched_eh().
> 
> However, if we do host_eh_scheduled--, then if the EH was triggered by
> ata_qc_schedule_eh(), then host_eh_scheduled will decrease < 0,
> which will trigger SCSI EH to wake up again :)
> 
> We could do something like only decreasing host_eh_scheduled if it is > 0.
> The QCs added to EH using ata_qc_schedule_eh() will be handled by
> ata_eh_finish(), which will iterate over all QCs owned by EH, and will
> either fail or retry each QC. After that scsi_error_handler() has finished
> the call to eh_strategy_handler() (ata_scsi_error()) it will unconditionally
> set host_failed to 0:
> https://github.com/torvalds/linux/blob/v6.5/drivers/scsi/scsi_error.c#L2331-L2337
> 
> So something like this on top of the patch in $subject:
> 
> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
> index 2d5ecd68b7e0..9ab12d7f6d9f 100644
> --- a/drivers/ata/libata-eh.c
> +++ b/drivers/ata/libata-eh.c
> @@ -952,7 +952,13 @@ EXPORT_SYMBOL_GPL(ata_std_sched_eh);
>   */
>  void ata_std_end_eh(struct ata_port *ap)
>  {
> -       ap->scsi_host->host_eh_scheduled--;
> +       struct Scsi_Host *host = ap->scsi_host;
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(host->host_lock, flags);
> +       if (host->host_eh_scheduled > 0)
> +               host->host_eh_scheduled--;
> +       spin_unlock_irqrestore(host->host_lock, flags);
>  }
>  EXPORT_SYMBOL(ata_std_end_eh);
> 
> 
> With that incremental patch, I can no longer reproduce the crash reported
> by the kernel test robot in my QEMU setup.

I am not confident that playing with host_eh_schedule count is the right
approach. A better solution may be to change the timing of clearing
ATA_PFLAG_EH_PENDING. Right now, this is done on entry to
ata_scsi_port_error_handler(), unconditionally. So ata_eh_reset() should not
need to clear the flag again. If we remove that, then a new interrupt received
after ata_eh_thaw() and setting EH_PENDING would be cought by the retry loop in
ata_scsi_port_error_handler(), which would run again ap->ops->error_handler(ap).

So let's try this fix instead:

diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index 159ba6ba19eb..d1d081aa0c95 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -2807,7 +2807,6 @@ int ata_eh_reset(struct ata_link *link, int classify,
        memset(&link->eh_info, 0, sizeof(link->eh_info));
        if (slave)
                memset(&slave->eh_info, 0, sizeof(link->eh_info));
-       ap->pflags &= ~ATA_PFLAG_EH_PENDING;
        spin_unlock_irqrestore(link->ap->lock, flags);

        if (ata_port_is_frozen(ap))

Li,

Can you please test this ?

> 
> 
> 
> It might be worth mentioning that the race window for the bug that the patch
> in $subject is fixing, should be much smaller after this patch is in:
> https://lore.kernel.org/linux-ide/20230907081710.4946-1-Chloe_Chen@asmedia.com.tw/
> 
> Li Nan, perhaps you could see if you can still reproduce your original
> problem with the patch from the ASMedia guys?


> 
> However, even with the ASMedia patch, it should still be theoretically
> possible to get an error irq after ata_eh_reset() has called ahci_thaw(),
> so I suppose that this patch still makes some sense...
> 
> 
> Kind regards,
> Niklas

-- 
Damien Le Moal
Western Digital Research

next prev parent reply	other threads:[~2023-09-11  7:17 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-06  8:42 [PATCH v4] ata: libata-eh: Honor all EH scheduling requests linan666
2023-09-07  7:43 ` kernel test robot
2023-09-07 21:02   ` Niklas Cassel
2023-09-11  7:16     ` Damien Le Moal [this message]
2023-09-11 11:32       ` Li Nan
2023-09-11 12:40       ` Niklas Cassel
2023-09-11 11:26     ` Li Nan

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:159ba6ba19e dfblob:d1d081aa0c9 )
 OR (
bs:"Re: [PATCH v4] ata: libata-eh: Honor all EH scheduling requests" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d88625ca-9bc7-cf33-2fa7-9e71d4153e7f@kernel.org \
    --to=dlemoal@kernel.org \
    --cc=Niklas.Cassel@wdc.com \
    --cc=houtao1@huawei.com \
    --cc=htejun@gmail.com \
    --cc=linan122@huawei.com \
    --cc=linan666@huaweicloud.com \
    --cc=linux-ide@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lkp@intel.com \
    --cc=luojian5@huawei.com \
    --cc=oe-lkp@lists.linux.dev \
    --cc=oliver.sang@intel.com \
    --cc=yangerkun@huawei.com \
    --cc=yi.zhang@huawei.com \
    --cc=yukuai3@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox