All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Dolev Raviv" <draviv@codeaurora.org>
To: ygardi@codeaurora.org
Cc: 'Hannes Reinecke' <hare@suse.de>,
	james.bottomley@hansenpartnership.com,
	linux-kernel@vger.kernel.org, linux-scsi@vger.kernel.org,
	linux-arm-msm@vger.kernel.org, santoshsy@gmail.com,
	linux-scsi-owner@vger.kernel.org,
	'Gilad Broner' <gbroner@codeaurora.org>,
	'Vinayak Holikatti' <vinholikatti@gmail.com>,
	"'James E.J. Bottomley'" <jbottomley@odin.com>,
	"'Martin K. Petersen'" <martin.petersen@oracle.com>
Subject: RE: [PATCH v5 03/15] scsi: ufs: implement scsi host timeout handler
Date: Tue, 8 Mar 2016 14:26:18 +0200	[thread overview]
Message-ID: <001801d17935$ba9109b0$2fb31d10$@codeaurora.org> (raw)
In-Reply-To: <8f204a77c853df2c10aeff847f64f1c0.squirrel@us.codeaurora.org>

>> On 03/03/2016 05:10 PM, ygardi@codeaurora.org wrote:
>>>> On 03/01/2016 09:25 PM, ygardi@codeaurora.org wrote:
>>>>>> On 02/28/2016 09:32 PM, Yaniv Gardi wrote:
>>>>>>> A race condition exists between request requeueing and scsi 
>>>>>>> layer error handling:
>>>>>>> When UFS driver queuecommand returns a busy status for a 
>>>>>>> request, it will be requeued and its tag will be freed and set to
-1.
>>>>>>> At the same time it is possible that the request will timeout 
>>>>>>> and scsi layer will start error handling for it. The scsi layer 
>>>>>>> reuses the request and its tag to send error related commands to 
>>>>>>> the device, however its tag is no longer valid.
>>>>>> Hmm. How can the host return a 'busy' status for a request?
>>>>>> From my understanding we have three possibilities:
>>>>>>
>>>>>> 1) queuecommand returns busy; however, that means that the 
>>>>>> command has never been send and this issue shouldn't occur
>>>>>> 2) The command returns with BUSY status. But in this case it has 
>>>>>> already been returned, so there cannot be any timeout coming in.
>>>>>> 3) The host receives a command with a tag which is already in-use.
>>>>>> However, that should have been prevented by the block-layer, 
>>>>>> which really should ensure that this situation never happens.
>>>>>>
>>>>>> So either way I look at it, it really looks like a bug and adding 
>>>>>> a timeout handler will just paper over it.
>>>>>> (Not that a timeout handler is a bad idea, in fact I'm convinced 
>>>>>> that you need one. Just not for this purpose.)
>>>>>>
>>>>>> So can you elaborate how this 'busy' status comes about?
>>>>>> Is the command sent to the device?
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Hannes
>>>>>
>>>>>
>>>>> Hi Hannes,
>>>>>
>>>>> it's going to be a bit long :)
>>>>> I think you are missing the point.
>>>>> I will describe a race condition happened to us a while ago, that 
>>>>> was quite difficult to understand and fix.
>>>>> So, this patch is not about the "busy" returning to the scsi 
>>>>> dispatch routine. it's about the abort triggered after 30 seconds.
>>>>>
>>>>> imagine a request being queued and sent to the scsi, and then to 
>>>>> the ufs.
>>>>> a timer, initialized to 30 seconds start ticking.
>>>>> but the request is never sent to the ufs device, as queuecommand() 
>>>>> returns with "SCSI_MLQUEUE_HOST_BUSY"
>>>>> by looking at the code, this could happen, for example:
>>>>> 	err = ufshcd_hold(hba, true);
>>>>> 	if (err) {
>>>>> 		err = SCSI_MLQUEUE_HOST_BUSY;
>>>>> 		goto out;
>>>>> 	}
>>>>>
>>>> Uuhhh.
>>>> You probably should not have pointed me to that piece of code ...
>>>> open-coding loops in ufshcd_hold() ... shudder.
>>>> (Did I ever review that one? Must've ...)
>>>> _Anyway_: sleeping in queuecommand is always a bad idea, as then 
>>>> precisely those issues you've just described will happen.
>>>>
>>>> Couldn't you just call
>>>> ufshcd_hold(hba, false)
>>>> instead of
>>>> ufshcd_hold(hba, true)
>>>> ?
>>>> The request will be requeued more-or-less immediately, avoiding the 
>>>> issue with timeout handler kicking in.
>>>> And the queue will remain blocked until the ungate work item 
>>>> returns, at which point I/O submission will continue.
>>>> As the request will be requeued to the head of the queue there 
>>>> won't be other I/O competing with tags, so it shouldn't have any 
>>>> adverse effects.
>>>>
>>>> Wouldn't that work?
>>>>
>>>> Cheers,
>>>>
>>>> Hannes
>>>
>>> Hi Hannes
>>>
>>> This is a bug, and it should be fixed.
>> Oh, definitely agreed. The question is _where_.
>>
>>
>>> if you choose to bypass it, by calling ufshcd_hold(hba, false), not 
>>> only the race condition is still there, and can pop-out at any other 
>>> point in the future, but also, not sure what are the consequences of 
>>> ufshcd_hold(hba, false) unstead of "true".
>> Well ... seeing it's your driver, I would've thought _you_ should 
>> know ...
>>
>>> so, changing the already tested and working code, (not to return 
>>> BUSY from
>>> queuecommand) is not a fix.
>> Hey, I did _not_ suggest not to retury BUSY from queuecommand.
>>
>> I was suggesting this patch:
>>
>> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c 
>> index 9c1b94b..b9295ad 100644
>> --- a/drivers/scsi/ufs/ufshcd.c
>> +++ b/drivers/scsi/ufs/ufshcd.c
>> @@ -1388,7 +1388,7 @@ static int ufshcd_queuecommand(struct Scsi_Host 
>> *host, struct scsi_cmnd *cmd)
>>                 goto out;
>>         }
>>
>> -       err = ufshcd_hold(hba, true);
>> +       err = ufshcd_hold(hba, false);
>>         if (err) {
>>                 err = SCSI_MLQUEUE_HOST_BUSY;
>>                 clear_bit_unlock(tag, &hba->lrb_in_use);
>>
>> which, by reading the code, should be avoiding this issue.
>
>
> Hannes,
> we are not trying to avoid returning BUSY from queuecommand().
> On the contrary. By returning BUSY we actually re-queuing the request 
> which is exactly what we need to do.
> your patch doesn't fix the race condition.
>
> thanks,
> Yaniv
>
>> I was just asking you if you could give this patch a spin and see if 
>> it works. If not (for whatever reason) I'm happy to accept your patch.
>> But first I would like to have an explanation why the above would 
>> _not_ work.
>>
>> Unfortunately I don't have the hardware otherwise I'd be running the 
>> tests myself.
>>
>> Cheers,
>>
>> Hannes
>> --
>> Dr. Hannes Reinecke		      zSeries & Storage
>> hare@suse.de			      +49 911 74053 688
>> SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
>> GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-scsi" 
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
>

I reviewed the patch, you can add 

Reviewed-by: Dolev Raviv <draviv@codeaurora.org>

Thanks,
Dolev
-- 
Qualcomm Israel, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux
Foundation Collaborative Project

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

WARNING: multiple messages have this Message-ID (diff)
From: "Dolev Raviv" <draviv@codeaurora.org>
To: <ygardi@codeaurora.org>
Cc: "'Hannes Reinecke'" <hare@suse.de>,
	<james.bottomley@hansenpartnership.com>,
	<linux-kernel@vger.kernel.org>, <linux-scsi@vger.kernel.org>,
	<linux-arm-msm@vger.kernel.org>, <santoshsy@gmail.com>,
	<linux-scsi-owner@vger.kernel.org>,
	"'Gilad Broner'" <gbroner@codeaurora.org>,
	"'Vinayak Holikatti'" <vinholikatti@gmail.com>,
	"'James E.J. Bottomley'" <jbottomley@odin.com>,
	"'Martin K. Petersen'" <martin.petersen@oracle.com>
Subject: RE: [PATCH v5 03/15] scsi: ufs: implement scsi host timeout handler
Date: Tue, 8 Mar 2016 14:26:18 +0200	[thread overview]
Message-ID: <001801d17935$ba9109b0$2fb31d10$@codeaurora.org> (raw)
In-Reply-To: <8f204a77c853df2c10aeff847f64f1c0.squirrel@us.codeaurora.org>

>> On 03/03/2016 05:10 PM, ygardi@codeaurora.org wrote:
>>>> On 03/01/2016 09:25 PM, ygardi@codeaurora.org wrote:
>>>>>> On 02/28/2016 09:32 PM, Yaniv Gardi wrote:
>>>>>>> A race condition exists between request requeueing and scsi 
>>>>>>> layer error handling:
>>>>>>> When UFS driver queuecommand returns a busy status for a 
>>>>>>> request, it will be requeued and its tag will be freed and set to
-1.
>>>>>>> At the same time it is possible that the request will timeout 
>>>>>>> and scsi layer will start error handling for it. The scsi layer 
>>>>>>> reuses the request and its tag to send error related commands to 
>>>>>>> the device, however its tag is no longer valid.
>>>>>> Hmm. How can the host return a 'busy' status for a request?
>>>>>> From my understanding we have three possibilities:
>>>>>>
>>>>>> 1) queuecommand returns busy; however, that means that the 
>>>>>> command has never been send and this issue shouldn't occur
>>>>>> 2) The command returns with BUSY status. But in this case it has 
>>>>>> already been returned, so there cannot be any timeout coming in.
>>>>>> 3) The host receives a command with a tag which is already in-use.
>>>>>> However, that should have been prevented by the block-layer, 
>>>>>> which really should ensure that this situation never happens.
>>>>>>
>>>>>> So either way I look at it, it really looks like a bug and adding 
>>>>>> a timeout handler will just paper over it.
>>>>>> (Not that a timeout handler is a bad idea, in fact I'm convinced 
>>>>>> that you need one. Just not for this purpose.)
>>>>>>
>>>>>> So can you elaborate how this 'busy' status comes about?
>>>>>> Is the command sent to the device?
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Hannes
>>>>>
>>>>>
>>>>> Hi Hannes,
>>>>>
>>>>> it's going to be a bit long :)
>>>>> I think you are missing the point.
>>>>> I will describe a race condition happened to us a while ago, that 
>>>>> was quite difficult to understand and fix.
>>>>> So, this patch is not about the "busy" returning to the scsi 
>>>>> dispatch routine. it's about the abort triggered after 30 seconds.
>>>>>
>>>>> imagine a request being queued and sent to the scsi, and then to 
>>>>> the ufs.
>>>>> a timer, initialized to 30 seconds start ticking.
>>>>> but the request is never sent to the ufs device, as queuecommand() 
>>>>> returns with "SCSI_MLQUEUE_HOST_BUSY"
>>>>> by looking at the code, this could happen, for example:
>>>>> 	err = ufshcd_hold(hba, true);
>>>>> 	if (err) {
>>>>> 		err = SCSI_MLQUEUE_HOST_BUSY;
>>>>> 		goto out;
>>>>> 	}
>>>>>
>>>> Uuhhh.
>>>> You probably should not have pointed me to that piece of code ...
>>>> open-coding loops in ufshcd_hold() ... shudder.
>>>> (Did I ever review that one? Must've ...)
>>>> _Anyway_: sleeping in queuecommand is always a bad idea, as then 
>>>> precisely those issues you've just described will happen.
>>>>
>>>> Couldn't you just call
>>>> ufshcd_hold(hba, false)
>>>> instead of
>>>> ufshcd_hold(hba, true)
>>>> ?
>>>> The request will be requeued more-or-less immediately, avoiding the 
>>>> issue with timeout handler kicking in.
>>>> And the queue will remain blocked until the ungate work item 
>>>> returns, at which point I/O submission will continue.
>>>> As the request will be requeued to the head of the queue there 
>>>> won't be other I/O competing with tags, so it shouldn't have any 
>>>> adverse effects.
>>>>
>>>> Wouldn't that work?
>>>>
>>>> Cheers,
>>>>
>>>> Hannes
>>>
>>> Hi Hannes
>>>
>>> This is a bug, and it should be fixed.
>> Oh, definitely agreed. The question is _where_.
>>
>>
>>> if you choose to bypass it, by calling ufshcd_hold(hba, false), not 
>>> only the race condition is still there, and can pop-out at any other 
>>> point in the future, but also, not sure what are the consequences of 
>>> ufshcd_hold(hba, false) unstead of "true".
>> Well ... seeing it's your driver, I would've thought _you_ should 
>> know ...
>>
>>> so, changing the already tested and working code, (not to return 
>>> BUSY from
>>> queuecommand) is not a fix.
>> Hey, I did _not_ suggest not to retury BUSY from queuecommand.
>>
>> I was suggesting this patch:
>>
>> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c 
>> index 9c1b94b..b9295ad 100644
>> --- a/drivers/scsi/ufs/ufshcd.c
>> +++ b/drivers/scsi/ufs/ufshcd.c
>> @@ -1388,7 +1388,7 @@ static int ufshcd_queuecommand(struct Scsi_Host 
>> *host, struct scsi_cmnd *cmd)
>>                 goto out;
>>         }
>>
>> -       err = ufshcd_hold(hba, true);
>> +       err = ufshcd_hold(hba, false);
>>         if (err) {
>>                 err = SCSI_MLQUEUE_HOST_BUSY;
>>                 clear_bit_unlock(tag, &hba->lrb_in_use);
>>
>> which, by reading the code, should be avoiding this issue.
>
>
> Hannes,
> we are not trying to avoid returning BUSY from queuecommand().
> On the contrary. By returning BUSY we actually re-queuing the request 
> which is exactly what we need to do.
> your patch doesn't fix the race condition.
>
> thanks,
> Yaniv
>
>> I was just asking you if you could give this patch a spin and see if 
>> it works. If not (for whatever reason) I'm happy to accept your patch.
>> But first I would like to have an explanation why the above would 
>> _not_ work.
>>
>> Unfortunately I don't have the hardware otherwise I'd be running the 
>> tests myself.
>>
>> Cheers,
>>
>> Hannes
>> --
>> Dr. Hannes Reinecke		      zSeries & Storage
>> hare@suse.de			      +49 911 74053 688
>> SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
>> GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-scsi" 
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
>

I reviewed the patch, you can add 

Reviewed-by: Dolev Raviv <draviv@codeaurora.org>

Thanks,
Dolev
-- 
Qualcomm Israel, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux
Foundation Collaborative Project

  reply	other threads:[~2016-03-08 12:26 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-02-28 13:32 [PATCH v5 00/15] add fixes, device quirks, error recovery, Yaniv Gardi
2016-02-28 13:32 ` [PATCH v5 01/15] scsi: ufs-qcom: add number of lanes per direction Yaniv Gardi
2016-02-28 13:32   ` Yaniv Gardi
     [not found]   ` <1456666367-11418-2-git-send-email-ygardi-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
2016-03-01  5:08     ` Hannes Reinecke
2016-03-01  5:08       ` Hannes Reinecke
2016-03-03 22:18   ` Rob Herring
2016-02-28 13:32 ` [PATCH v5 02/15] scsi: ufs: avoid spurious UFS host controller interrupts Yaniv Gardi
2016-03-01  5:10   ` Hannes Reinecke
2016-03-01  5:10     ` Hannes Reinecke
2016-02-28 13:32 ` [PATCH v5 03/15] scsi: ufs: implement scsi host timeout handler Yaniv Gardi
2016-03-01  7:29   ` Hannes Reinecke
2016-03-01  7:29     ` Hannes Reinecke
2016-03-01 13:25     ` ygardi
2016-03-03  7:22       ` Hannes Reinecke
2016-03-03  9:10         ` ygardi
2016-03-03 12:53           ` Hannes Reinecke
2016-03-06 10:33             ` ygardi
2016-03-06 10:33               ` ygardi
2016-03-08 11:48               ` ygardi
2016-03-08 11:48                 ` ygardi
2016-03-08 11:48               ` ygardi
2016-03-08 11:48                 ` ygardi
2016-03-08 12:26                 ` Dolev Raviv [this message]
2016-03-08 12:26                   ` Dolev Raviv
2016-02-28 13:32 ` [PATCH v5 04/15] scsi: ufs: verify hba controller hce reg value Yaniv Gardi
2016-03-01  7:32   ` Hannes Reinecke
2016-03-01 13:32     ` ygardi
2016-03-03  7:24       ` Hannes Reinecke
2016-02-28 13:32 ` [PATCH v5 05/15] scsi: ufs: add support to read device and string descriptors Yaniv Gardi
2016-03-01  7:35   ` Hannes Reinecke
2016-03-01 10:01     ` ygardi
2016-03-01 10:03       ` Hannes Reinecke
2016-02-28 13:32 ` [PATCH v5 06/15] scsi: ufs: separate device and host quirks Yaniv Gardi
2016-03-01  7:38   ` Hannes Reinecke
2016-02-28 13:32 ` [PATCH v5 07/15] scsi: ufs: disable vccq if it's not needed by UFS device Yaniv Gardi
2016-03-01  7:36   ` Hannes Reinecke
2016-02-28 13:32 ` [PATCH v5 08/15] scsi: ufs: make error handling bit faster Yaniv Gardi
2016-03-01  7:50   ` Hannes Reinecke
2016-03-01  9:56     ` ygardi
2016-03-01 10:02       ` Hannes Reinecke
2016-02-28 13:32 ` [PATCH v5 09/15] scsi: ufs: add error recovery after DL NAC error Yaniv Gardi
2016-03-01  7:51   ` Hannes Reinecke
2016-02-28 13:32 ` [PATCH v5 10/15] scsi: ufs: add retry for query descriptors Yaniv Gardi
2016-03-01  7:53   ` Hannes Reinecke
2016-03-01  7:53     ` Hannes Reinecke
2016-02-28 13:32 ` [PATCH v5 11/15] scsi: ufs: handle non spec compliant bkops behaviour by device Yaniv Gardi
2016-03-01  7:54   ` Hannes Reinecke
2016-02-28 13:32 ` [PATCH v5 12/15] scsi: ufs: tune UniPro parameters to optimize hibern8 exit time Yaniv Gardi
2016-03-01  7:55   ` Hannes Reinecke
2016-02-28 13:32 ` [PATCH v5 13/15] scsi: ufs: fix leakage during link off state Yaniv Gardi
2016-03-01  7:56   ` Hannes Reinecke
2016-02-28 13:32 ` [PATCH v5 14/15] scsi: ufs: add device quirk delay before putting UFS rails in LPM Yaniv Gardi
2016-03-01  7:57   ` Hannes Reinecke
2016-02-28 13:32 ` [PATCH v5 15/15] scsi: ufs-qcom: set PA_Local_TX_LCC_Enable before link startup Yaniv Gardi
2016-03-01  7:58   ` Hannes Reinecke
2016-03-01  7:58     ` Hannes Reinecke
2016-03-06 11:57     ` ygardi
2016-03-06 11:57       ` ygardi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='001801d17935$ba9109b0$2fb31d10$@codeaurora.org' \
    --to=draviv@codeaurora.org \
    --cc=gbroner@codeaurora.org \
    --cc=hare@suse.de \
    --cc=james.bottomley@hansenpartnership.com \
    --cc=jbottomley@odin.com \
    --cc=linux-arm-msm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-scsi-owner@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=santoshsy@gmail.com \
    --cc=vinholikatti@gmail.com \
    --cc=ygardi@codeaurora.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.