Re: [PATCH v2 4/4] scsi: ufs: Fix up and simplify error recovery mechanism

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Can Guo <cang@codeaurora.org>
To: Bart Van Assche <bvanassche@acm.org>
Cc: asutoshd@codeaurora.org, nguyenb@codeaurora.org,
	hongwus@codeaurora.org, rnayak@codeaurora.org,
	linux-scsi@vger.kernel.org, kernel-team@android.com,
	saravanak@google.com, salyzyn@google.com,
	Alim Akhtar <alim.akhtar@samsung.com>,
	Avri Altman <avri.altman@wdc.com>,
	"James E.J. Bottomley" <jejb@linux.ibm.com>,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	Stanley Chu <stanley.chu@mediatek.com>,
	Nitin Rawat <nitirawa@codeaurora.org>,
	Tomas Winkler <tomas.winkler@intel.com>,
	Bean Huo <beanhuo@micron.com>,
	Satya Tangirala <satyat@google.com>,
	open list <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v2 4/4] scsi: ufs: Fix up and simplify error recovery mechanism
Date: Tue, 14 Jul 2020 17:13:16 +0800	[thread overview]
Message-ID: <5fb1e82c97a480e5330337a240a12633@codeaurora.org> (raw)
In-Reply-To: <47e7a4ec9a0404bc6d01818fcdad90eb@codeaurora.org>

Hi Bart,

On 2020-07-14 12:26, Can Guo wrote:
> Hi Bart,
> 
> On 2020-07-14 11:52, Bart Van Assche wrote:
>> On 2020-07-13 19:28, Can Guo wrote:
>>> o Queue eh_work on a single threaded workqueue to avoid concurrency 
>>> between
>>>   eh_works.
>> 
>> Please use another approach (mutex?) to serialize error handling. 
>> There are
>> already way too workqueues in a running Linux system.
>> 

Yeah, mutex works, but in this change, we need to flush the eh_work. As 
per
test, in real cases, flush_work can trigger warnings if the work is 
queued on
system_wq. Please check func check_flush_dependency().

>>> o According to the UFSHCI JEDEC spec, hibern8 enter/exit error occurs 
>>> when
>>>   the link is broken. This actaully applies to any power mode change
>>>   operations. In this change, if a power mode change operation 
>>> (including
>>>   AH8 enter/exit) fails, mark the link state as UIC_LINK_BROKEN_STATE 
>>> and
>>>   schedule eh_work. eh_work needs to do full reset and restore to 
>>> recover
>>>   the link back to active. Before the link state is recovered to 
>>> active by
>>>   eh_work, any power mode change attempts just return -ENOLINK to 
>>> avoid
>>>   consecutive HW error.
>>> 
>>> o To avoid concurrency between eh_work and link recovery, remove link
>>>   recovery from hibern8 enter/exit func. If hibern8 enter/exit func 
>>> fails,
>>>   simply return error code and let eh_work run in parallel.
>>> 
>>> o Recover UFS hba runtime PM error in eh_work. If 
>>> ufschd_suspend/resume
>>>   fails due to UFS error, e.g. hibern8 enter/exit error and SSU cmd 
>>> error,
>>>   the runtime PM framework saves the error to 
>>> dev.power.runtime_error.
>>>   After that, hba runtime suspend/resume would not be invoked anymore 
>>> until
>>>   dev.power.runtime_error is cleared. The runtime PM error can be 
>>> recovered
>>>   in eh_work by calling pm_runtime_set_active() after reset and 
>>> restore
>>>   succeeds. Meanwhile, if pm_runtime_set_active() returns no error, 
>>> which
>>>   means dev.power.runtime_error is cleared, we also need to 
>>> explicitly
>>>   resume those scsi devices under hba in case any of them has failed 
>>> to be
>>>   resumed due to hba runtime resume error.
>>> 
>>> o Fix a racing problem between eh_work and ufshcd_suspend/resume. In 
>>> the
>>>   old code, it blocks scsi requests before schedules eh_work, but 
>>> when
>>>   eh_work calls pm_runtime_get_sync(), if ufshcd_suspend/resume is 
>>> sending
>>>   a scsi cmd, most likely the SSU cmd, pm_runtime_get_sync() will 
>>> never
>>>   return because scsi requests were blocked. To fix this racing 
>>> problem,
>>>   o Don't block scsi requests before schedule eh_work, but let 
>>> eh_work
>>>     block scsi requests when eh_work is ready to start error 
>>> recovery.
>>>   o Meanwhile, if eh_work is schueduled due to fatal error, don't 
>>> requeue
>>>     the scsi cmds sent from ufshcd_suspend/resume path, but simply 
>>> let the
>>>     scsi cmds fail. If the scsi cmds fail, hba runtime suspend/resume 
>>> fails
>>>     too, but it does hurt since eh_work recovers hba runtime PM 
>>> error.
>>> 
>>> o Move host/regs dump in ufshcd_check_errors() to eh_work because 
>>> heavy
>>>   dump in IRQ context can lead to stability issues. In addition, some 
>>> clean
>>>   up in ufshcd_print_host_regs() and ufshcd_print_host_state().
>> 
>> The above list is a long list. To me that is a sign that this patch 
>> needs to
>> be split into multiple patches.
>> 
>> Thanks,
>> 
>> Bart.
> 
> Sure, will split it into a few patches.
> 
> Thanks,
> 
> Can Guo.

I tried, but I find it hard to split it as it works as a whole, it is a 
refactor
change rather than a mixture of multiple fixes. I will try to refine the 
commit
msg in next version. So it goes just as it is now.

Thanks,

Can Guo.

next prev parent reply	other threads:[~2020-07-14  9:13 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-14  2:28 [PATCH v2 0/4] Fix up and simplify error recovery mechanism Can Guo
2020-07-14  2:28 ` [PATCH v2 1/4] scsi: ufs: Add checks before setting clk-gating states Can Guo
2020-07-14  3:38   ` Bart Van Assche
2020-07-14  4:11     ` Can Guo
2020-07-14  2:28 ` [PATCH v2 2/4] scsi: ufs: Fix imbalanced scsi_block_reqs_cnt caused by ufshcd_hold() Can Guo
2020-07-14  3:41   ` Bart Van Assche
2020-07-14  4:11     ` Can Guo
2020-07-14  2:28 ` [PATCH v2 3/4] ufs: ufs-qcom: Fix a few BUGs in func ufs_qcom_dump_dbg_regs() Can Guo
2020-07-14  3:47   ` Bart Van Assche
2020-07-14  4:17     ` Can Guo
2020-07-14  2:28 ` [PATCH v2 4/4] scsi: ufs: Fix up and simplify error recovery mechanism Can Guo
2020-07-14  3:52   ` Bart Van Assche
2020-07-14  4:26     ` Can Guo
2020-07-14  9:13       ` Can Guo [this message]
  -- strict thread matches above, loose matches on Subject: below --
2020-07-14 20:46 kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5fb1e82c97a480e5330337a240a12633@codeaurora.org \
    --to=cang@codeaurora.org \
    --cc=alim.akhtar@samsung.com \
    --cc=asutoshd@codeaurora.org \
    --cc=avri.altman@wdc.com \
    --cc=beanhuo@micron.com \
    --cc=bvanassche@acm.org \
    --cc=hongwus@codeaurora.org \
    --cc=jejb@linux.ibm.com \
    --cc=kernel-team@android.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=nguyenb@codeaurora.org \
    --cc=nitirawa@codeaurora.org \
    --cc=rnayak@codeaurora.org \
    --cc=salyzyn@google.com \
    --cc=saravanak@google.com \
    --cc=satyat@google.com \
    --cc=stanley.chu@mediatek.com \
    --cc=tomas.winkler@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.