Re: [PATCH v11 07/34] scsi: sd: Have scsi-ml retry read_capacity_16 errors

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

From: Mike Christie <michael.christie@oracle.com>
To: Martin Wilck <mwilck@suse.com>,
	john.g.garry@oracle.com, bvanassche@acm.org, hch@lst.de,
	martin.petersen@oracle.com, linux-scsi@vger.kernel.org,
	james.bottomley@hansenpartnership.com
Subject: Re: [PATCH v11 07/34] scsi: sd: Have scsi-ml retry read_capacity_16 errors
Date: Mon, 18 Sep 2023 13:45:27 -0500	[thread overview]
Message-ID: <e35f738c-b6d6-4e86-aa29-875b3629a0b7@oracle.com> (raw)
In-Reply-To: <25a0b3bace532c5340196466f8a4194c9b8da473.camel@suse.com>

On 9/18/23 11:48 AM, Martin Wilck wrote:
> On Sun, 2023-09-17 at 19:35 -0500, Mike Christie wrote:
>> On 9/15/23 4:34 PM, Martin Wilck wrote:
>>> On Fri, 2023-09-15 at 22:21 +0200, Martin Wilck wrote:
>>>> On Tue, 2023-09-05 at 18:15 -0500, Mike Christie wrote:
>>>>> This has read_capacity_16 have scsi-ml retry errors instead of
>>>>> driving
>>>>> them itself.
>>>>>
>>>>> There are 2 behavior changes with this patch:
>>>>> 1. There is one behavior change where we no longer retry when
>>>>> scsi_execute_cmd returns < 0, but we should be ok. We don't
>>>>> need to
>>>>> retry
>>>>> for failures like the queue being removed, and for the case
>>>>> where
>>>>> there
>>>>> are no tags/reqs since the block layer waits/retries for us.
>>>>> For
>>>>> possible
>>>>> memory allocation failures from blk_rq_map_kern we use
>>>>> GFP_NOIO, so
>>>>> retrying will probably not help.
>>>>> 2. For the specific UAs we checked for and retried, we would
>>>>> get
>>>>> READ_CAPACITY_RETRIES_ON_RESET retries plus whatever retries
>>>>> were
>>>>> left
>>>>> from the loop's retries. Each UA now gets
>>>>> READ_CAPACITY_RETRIES_ON_RESET
>>>>> reties, and the other errors (not including medium not present)
>>>>> get
>>>>> up
>>>>> to 3 retries.
>>>>
>>>> This is ok, but - just a thought - would it make sense to add a
>>>> field
>>>> for maximum total retry count (summed over all failures)? That
>>>> would
>>>> allow us to minimize behavior changes also in other cases.
>>>
>>> Could we perhaps use scmd->allowed for this purpose?
>>>
>>> I noted that several callers of scsi_execute_cmd() in your patch
>>> set
>>> set the allowed parameter, e.g. to sdkp->max_retries in 07/34.
>>> But allowed doesn't seem to be used at all in the passthrough case,
>>
>> I think scmd->allowed is still used for errors that don't hit the
>> check_type goto in scsi_no_retry_cmd.
>>
>> If the user sets up scsi failures for only UAs, and we got a
>> DID_TRANSPORT_DISRUPTED then we we do scsi_cmd_retry_allowed which
>> checks scmd->allowed.
>>
>> In scsi_noretry_cmd we then hit the:
>>
>>         if (!scsi_status_is_check_condition(scmd->result))
>>                 return false;
>>
>> and will retry.
> 
> Right. But that's pretty confusing. Actually, I am confused about some
> other things as well. 
> 
> I apologize for taking a step back and asking some more questions and
> presenting some more thoughts about your patch set as a whole.
> 
> For passthrough commands, the simplified logic is now:
> 
> scsi_decide_disposition() 
> {
>          if (!scsi_device_online)
>                 return SUCCESS;
>          if ((rtn = scsi_check_passthrough(scmd)) != SCSI_RETURN_NOT_HANDLED)
>                 return rtn;
> 
>          /* handle host byte for regular commands, 
>             may return SUCESS, NEEDS_RETRY/ADD_TO_MLQUEUE, 
>             FAILED, fall through, or jump to maybe_retry */
> 
>          /* handle status byte for regular commands, likewise */
>           
>  maybe_retry: /* [2] */
>          if (scsi_cmd_retry_allowed(scmd) && !scsi_noretry_cmd(scmd))
>                 return NEEDS_RETRY;
>          else
>                 return SUCCESS;
> }
> 
> where scsi_noretry_cmd() has special treatment for passthrough commands
> in certain cases (DID_TIME_OUT and CHECK_CONDITION).
> 
> Because you put the call to scsi_check_passthrough() before the
> standard checks, some retries that the mid layer used to do will now
> not be done if scsi_check_passthrough() returns SUCCESS. Also,

Yeah, I did that on purpose to give scsi_execute_cmd more control over
whether to retry or not. I think you are looking at this more like
we want to be able to retry when scsi-ml decided not to.

For example, I was thinking multipath related code like the scsi_dh handlers
would want to fail for cases scsi-ml was currently retrying. Right now they
are setting REQ_FAILFAST_TRANSPORT but for most drivers that's not doing what
they think it does. Only DID_BUS_BUSY fast fails and I think the scsi_dh
failover code wants other errors like DID_TRANSPORT_DISRUPTED to fail so the
caller can decide based on something like the dm-multipath pg retries.

> depending on the list of failures passed with the command, we might
> miss the clauses in scsi_decide_disposition() where we previously
> returned FAILED (I haven't reviewed the current callers, but at least
> in theory it can happen). This will obviously change the runtime
> behavior, in ways I wouldn't bet can't be detrimental.

I think it's reverse of what you are thinking for the FAILED case
but I agree it's wrong (wrong but for different reasons).

If scsi_decide_disposition returns FAILED then runtime is bad, because
the scsi eh will start and then we have to wait for that.

The scsi_execute_cmd user can now actually bypass the EH for FAILED
so runtime will be shorter. However, I agree that it's wrong we can
bypass the EH because in some cases we need to kick the device or
cleanup cmds. So that should be fixed for sure and we should always
start the EH and go through that code path.

> 
> Before your patch set, the checks we now do in scsi_check_passthrough()
> were only performed in the case where the "regular command checks"
> returned SUCCESS. If we want as little behavior change as possible, the
> SUCCESS case is where we should plug in the call to
> scsi_check_passthrough(). Or am I missing something? [1]
> 
> This way we'd preserve the current semantics of "retries" and "allowed"
> which used to relate only to what were the "mid-layer retries" before
> your patch set.

It looks like we had different priorities. I was trying to allow
scsi_execute_cmd to be able to override scsi-ml retries, and not just allow
us to retry if scsi-ml decided not to.

Given I messed up on the FAILED case above, I agree doing the less
risky approach is better now. We can change it for multipath some other
day.

> 
>>> so we might as well use it as an upper limit for the total number of
>>> retries.
>>>
>>
>> If you really want an overall retry count let me know. I can just add
>> a total limit to the scsi_failures struct. You have been the only person
>> to ask about it and it didn't sound like you were sure if you wanted it
>> or not, so I haven't done it.
> 
> The question whether we want an additional limit for the total "failure
> retry" count is orthogonal to the above. My personal opinion is that we
> do, as almost all call sites that I've seen in your set currently use
> just a single retry count for all failures they handle.
> 
> I'm not sure whether the separate total retry count would have
> practical relevance. I think that in most practical cases we won't have
> a mix of standard "ML" retries and multiple different failure cases. I
> suppose that in any given situation, it's more likely that there's a
> single error condition which repeats.

I'm not sure what you are saying in the 2 paragraphs above.

We have:

1. scsi_execute_cmd users like read_capacity_10 which will retry the
device reset UA 10 times. Then it can retry that error 3 more time
(this was probably not intentional so can be ignored) and it can retry
any error other error except medium not present 3 times.

And then it calls scsi_execute_cmd with sdkp->max_retries so the scsi-ml
retried are whatever the user requested and 5 by default.

I think we wanted to keep this behavior.

2.  Then, for the initial device setup/discovery tests where the transport is
flakey during device discovery/setup, we can hit the DID_TRANSPORT_DISRUPTED
that multiple times, then we will get UAs after we relogin/reset the
transport/device. So I think we want to keep the behavior where a user
does

+       struct scsi_failure failures[] = {
+               {
+                       .sense = UNIT_ATTENTION,
+                       .asc = SCMD_FAILURE_ASC_ANY,
+                       .ascq = SCMD_FAILURE_ASCQ_ANY,
+                       .allowed = UA_RETRY_COUNT,
+                       .result = SAM_STAT_CHECK_CONDITION,
+               },
+               {}
+       };

scsi_execute_cmd(.... $NORMAL_CMD_RETRIES)

then we can retry transport errors NORMAL_CMD_RETRIES then once those
settle still retry UAs UA_RETRY_COUNT times.

3. Then we have the cases where the scsi_execute_cmd user retried
multiple errors and in this patchset we used to retry a total of N
times, but now can retry each error up to N times. For this it sounds
like you want to code it so we only do a total of N times so the
behavior is the same as before.

To handle all these I think I need the extra allowed field on the
scsi_failures.

next prev parent reply	other threads:[~2023-09-18 18:45 UTC|newest]

Thread overview: 77+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-05 23:15 scsi: Allow scsi_execute users to control retries Mike Christie
2023-09-05 23:15 ` [PATCH v11 01/34] scsi: Add helper to prep sense during error handling Mike Christie
2023-09-05 23:15 ` [PATCH v11 02/34] scsi: Allow passthrough to override what errors to retry Mike Christie
2023-09-15 20:08   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 03/34] scsi: Add scsi_failure field to scsi_exec_args Mike Christie
2023-09-05 23:15 ` [PATCH v11 04/34] scsi: Have scsi-ml retry scsi_probe_lun errors Mike Christie
2023-09-15 20:11   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 05/34] scsi: retry INQUIRY after timeout Mike Christie
2023-09-15 20:11   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 06/34] scsi: sd: Fix sshdr use in read_capacity_16 Mike Christie
2023-09-15 20:13   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 07/34] scsi: sd: Have scsi-ml retry read_capacity_16 errors Mike Christie
2023-09-15 20:21   ` Martin Wilck
2023-09-15 21:34     ` Martin Wilck
2023-09-18  0:35       ` Mike Christie
2023-09-18 16:48         ` Martin Wilck
2023-09-18 18:45           ` Mike Christie [this message]
2023-09-19  9:07             ` Martin Wilck
2023-09-19 18:02               ` Mike Christie
2023-09-05 23:15 ` [PATCH v11 08/34] scsi: Use separate buf for START_STOP in sd_spinup_disk Mike Christie
2023-09-15 20:26   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 09/34] scsi: sd: Fix sshdr use " Mike Christie
2023-09-15 20:27   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 10/34] scsi: Have scsi-ml retry sd_spinup_disk errors Mike Christie
2023-09-15 20:46   ` Martin Wilck
2023-09-15 20:58     ` Mike Christie
2023-09-15 21:23       ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 11/34] scsi: hp_sw: Only access sshdr if res > 0 Mike Christie
2023-09-15 20:48   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 12/34] scsi: hp_sw: Have scsi-ml retry scsi_exec_req errors Mike Christie
2023-09-15 20:51   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 13/34] scsi: rdac: Fix send_mode_select retry handling Mike Christie
2023-09-15 20:54   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 14/34] scsi: rdac: Fix sshdr use Mike Christie
2023-09-15 20:55   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 15/34] scsi: rdac: Have scsi-ml retry send_mode_select errors Mike Christie
2023-09-15 20:58   ` Martin Wilck
2023-09-15 21:52     ` Mike Christie
2023-09-05 23:15 ` [PATCH v11 16/34] scsi: spi: Fix sshdr use Mike Christie
2023-09-15 20:59   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 17/34] scsi: spi: Have scsi-ml retry spi_execute errors Mike Christie
2023-09-15 21:00   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 18/34] scsi: sd: Fix sshdr use in sd_suspend_common Mike Christie
2023-09-15 21:04   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 19/34] scsi: sd: Have scsi-ml retry sd_sync_cache errors Mike Christie
2023-09-15 21:10   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 20/34] scsi: ch: Remove unit_attention Mike Christie
2023-09-15 21:08   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 21/34] scsi: ch: Have scsi-ml retry ch_do_scsi errors Mike Christie
2023-09-15 21:11   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 22/34] scsi: Have scsi-ml retry scsi_mode_sense UAs Mike Christie
2023-09-15 21:11   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 23/34] scsi: sd: Fix scsi_mode_sense caller's sshdr use Mike Christie
2023-09-15 21:13   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 24/34] scsi: Have scsi-ml retry scsi_report_lun_scan errors Mike Christie
2023-09-15 21:17   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 25/34] scsi: sd: Have pr commands retry UAs Mike Christie
2023-09-15 21:18   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 26/34] scsi: sd: Have scsi-ml retry read_capacity_10 errors Mike Christie
2023-09-15 21:25   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 27/34] scsi: ses: Have scsi-ml retry scsi_exec_req errors Mike Christie
2023-09-15 21:34   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 28/34] scsi: sr: Have scsi-ml retry get_sectorsize errors Mike Christie
2023-09-15 21:36   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 29/34] scsi: ufs: Have scsi-ml retry start stop errors Mike Christie
2023-09-15 21:37   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 30/34] scsi: Fix sshdr use in scsi_test_unit_ready Mike Christie
2023-09-15 21:38   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 31/34] scsi: Fix sshdr use in scsi_cdl_enable Mike Christie
2023-09-15 21:39   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 32/34] scsi: sd: Fix sshdr use in cache_type_store Mike Christie
2023-09-15 21:44   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 33/34] scsi: sr: Fix sshdr use in sr_get_events Mike Christie
2023-09-15 21:44   ` Martin Wilck
2023-09-05 23:15 ` [PATCH v11 34/34] scsi: Add kunit tests for scsi_check_passthrough Mike Christie
2023-09-15 21:52   ` Martin Wilck
2023-09-15 22:07     ` Mike Christie

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e35f738c-b6d6-4e86-aa29-875b3629a0b7@oracle.com \
    --to=michael.christie@oracle.com \
    --cc=bvanassche@acm.org \
    --cc=hch@lst.de \
    --cc=james.bottomley@hansenpartnership.com \
    --cc=john.g.garry@oracle.com \
    --cc=linux-scsi@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=mwilck@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox