From: Can Guo <cang@codeaurora.org>
To: Bart Van Assche <bvanassche@acm.org>
Cc: asutoshd@codeaurora.org, nguyenb@codeaurora.org,
hongwus@codeaurora.org, rnayak@codeaurora.org,
linux-scsi@vger.kernel.org, kernel-team@android.com,
saravanak@google.com, salyzyn@google.com,
Alim Akhtar <alim.akhtar@samsung.com>,
Avri Altman <avri.altman@wdc.com>,
"James E.J. Bottomley" <jejb@linux.ibm.com>,
"Martin K. Petersen" <martin.petersen@oracle.com>,
Stanley Chu <stanley.chu@mediatek.com>,
Nitin Rawat <nitirawa@codeaurora.org>,
Tomas Winkler <tomas.winkler@intel.com>,
Bean Huo <beanhuo@micron.com>,
Satya Tangirala <satyat@google.com>,
open list <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v2 4/4] scsi: ufs: Fix up and simplify error recovery mechanism
Date: Tue, 14 Jul 2020 17:13:16 +0800 [thread overview]
Message-ID: <5fb1e82c97a480e5330337a240a12633@codeaurora.org> (raw)
In-Reply-To: <47e7a4ec9a0404bc6d01818fcdad90eb@codeaurora.org>
Hi Bart,
On 2020-07-14 12:26, Can Guo wrote:
> Hi Bart,
>
> On 2020-07-14 11:52, Bart Van Assche wrote:
>> On 2020-07-13 19:28, Can Guo wrote:
>>> o Queue eh_work on a single threaded workqueue to avoid concurrency
>>> between
>>> eh_works.
>>
>> Please use another approach (mutex?) to serialize error handling.
>> There are
>> already way too workqueues in a running Linux system.
>>
Yeah, mutex works, but in this change, we need to flush the eh_work. As
per
test, in real cases, flush_work can trigger warnings if the work is
queued on
system_wq. Please check func check_flush_dependency().
>>> o According to the UFSHCI JEDEC spec, hibern8 enter/exit error occurs
>>> when
>>> the link is broken. This actaully applies to any power mode change
>>> operations. In this change, if a power mode change operation
>>> (including
>>> AH8 enter/exit) fails, mark the link state as UIC_LINK_BROKEN_STATE
>>> and
>>> schedule eh_work. eh_work needs to do full reset and restore to
>>> recover
>>> the link back to active. Before the link state is recovered to
>>> active by
>>> eh_work, any power mode change attempts just return -ENOLINK to
>>> avoid
>>> consecutive HW error.
>>>
>>> o To avoid concurrency between eh_work and link recovery, remove link
>>> recovery from hibern8 enter/exit func. If hibern8 enter/exit func
>>> fails,
>>> simply return error code and let eh_work run in parallel.
>>>
>>> o Recover UFS hba runtime PM error in eh_work. If
>>> ufschd_suspend/resume
>>> fails due to UFS error, e.g. hibern8 enter/exit error and SSU cmd
>>> error,
>>> the runtime PM framework saves the error to
>>> dev.power.runtime_error.
>>> After that, hba runtime suspend/resume would not be invoked anymore
>>> until
>>> dev.power.runtime_error is cleared. The runtime PM error can be
>>> recovered
>>> in eh_work by calling pm_runtime_set_active() after reset and
>>> restore
>>> succeeds. Meanwhile, if pm_runtime_set_active() returns no error,
>>> which
>>> means dev.power.runtime_error is cleared, we also need to
>>> explicitly
>>> resume those scsi devices under hba in case any of them has failed
>>> to be
>>> resumed due to hba runtime resume error.
>>>
>>> o Fix a racing problem between eh_work and ufshcd_suspend/resume. In
>>> the
>>> old code, it blocks scsi requests before schedules eh_work, but
>>> when
>>> eh_work calls pm_runtime_get_sync(), if ufshcd_suspend/resume is
>>> sending
>>> a scsi cmd, most likely the SSU cmd, pm_runtime_get_sync() will
>>> never
>>> return because scsi requests were blocked. To fix this racing
>>> problem,
>>> o Don't block scsi requests before schedule eh_work, but let
>>> eh_work
>>> block scsi requests when eh_work is ready to start error
>>> recovery.
>>> o Meanwhile, if eh_work is schueduled due to fatal error, don't
>>> requeue
>>> the scsi cmds sent from ufshcd_suspend/resume path, but simply
>>> let the
>>> scsi cmds fail. If the scsi cmds fail, hba runtime suspend/resume
>>> fails
>>> too, but it does hurt since eh_work recovers hba runtime PM
>>> error.
>>>
>>> o Move host/regs dump in ufshcd_check_errors() to eh_work because
>>> heavy
>>> dump in IRQ context can lead to stability issues. In addition, some
>>> clean
>>> up in ufshcd_print_host_regs() and ufshcd_print_host_state().
>>
>> The above list is a long list. To me that is a sign that this patch
>> needs to
>> be split into multiple patches.
>>
>> Thanks,
>>
>> Bart.
>
> Sure, will split it into a few patches.
>
> Thanks,
>
> Can Guo.
I tried, but I find it hard to split it as it works as a whole, it is a
refactor
change rather than a mixture of multiple fixes. I will try to refine the
commit
msg in next version. So it goes just as it is now.
Thanks,
Can Guo.
prev parent reply other threads:[~2020-07-14 9:13 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <1594693693-22466-1-git-send-email-cang@codeaurora.org>
2020-07-14 2:28 ` [PATCH v2 1/4] scsi: ufs: Add checks before setting clk-gating states Can Guo
2020-07-14 3:38 ` Bart Van Assche
2020-07-14 4:11 ` Can Guo
2020-07-14 2:28 ` [PATCH v2 2/4] scsi: ufs: Fix imbalanced scsi_block_reqs_cnt caused by ufshcd_hold() Can Guo
2020-07-14 3:41 ` Bart Van Assche
2020-07-14 4:11 ` Can Guo
2020-07-14 2:28 ` [PATCH v2 3/4] ufs: ufs-qcom: Fix a few BUGs in func ufs_qcom_dump_dbg_regs() Can Guo
2020-07-14 3:47 ` Bart Van Assche
2020-07-14 4:17 ` Can Guo
2020-07-14 2:28 ` [PATCH v2 4/4] scsi: ufs: Fix up and simplify error recovery mechanism Can Guo
2020-07-14 3:52 ` Bart Van Assche
2020-07-14 4:26 ` Can Guo
2020-07-14 9:13 ` Can Guo [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5fb1e82c97a480e5330337a240a12633@codeaurora.org \
--to=cang@codeaurora.org \
--cc=alim.akhtar@samsung.com \
--cc=asutoshd@codeaurora.org \
--cc=avri.altman@wdc.com \
--cc=beanhuo@micron.com \
--cc=bvanassche@acm.org \
--cc=hongwus@codeaurora.org \
--cc=jejb@linux.ibm.com \
--cc=kernel-team@android.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=nguyenb@codeaurora.org \
--cc=nitirawa@codeaurora.org \
--cc=rnayak@codeaurora.org \
--cc=salyzyn@google.com \
--cc=saravanak@google.com \
--cc=satyat@google.com \
--cc=stanley.chu@mediatek.com \
--cc=tomas.winkler@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox