* [PATCH] scsi-mq: fix hw queue hang caused by timeout
@ 2014-09-18 15:59 Ming Lei
2014-09-18 16:35 ` Christoph Hellwig
0 siblings, 1 reply; 6+ messages in thread
From: Ming Lei @ 2014-09-18 15:59 UTC (permalink / raw)
To: James Bottomley, Christoph Hellwig
Cc: Jens Axboe, linux-scsi, linux-kernel, Ming Lei
If there are two requests or more timed out, the dispatch queue
is put into stopped state and never be recoverd, and there
is no such problem in non-mq mode.
This patch trys to recover the stopped queue when the queue
becomes unbusy, then the following retries can move on.
Basically this patch maintains same behavior for this situation
with non-mq mode.
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
drivers/scsi/scsi_lib.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 3b92c39..dfbc028 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -142,6 +142,8 @@ static void __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
struct scsi_device *device = cmd->device;
struct request_queue *q = device->request_queue;
unsigned long flags;
+ bool restart = false;
+ bool blocked = !!atomic_read(&device->device_blocked);
SCSI_LOG_MLQUEUE(1, scmd_printk(KERN_INFO, cmd,
"Inserting command %p into mlqueue\n", cmd));
@@ -152,9 +154,14 @@ static void __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
* Decrement the counters, since these commands are no longer
* active on the host/device.
*/
- if (unbusy)
+ if (unbusy) {
scsi_device_unbusy(device);
+ /* need to restart hw queue if it was stopped */
+ if (!atomic_read(&device->device_busy) && blocked)
+ restart = true;
+ }
+
/*
* Requeue this command. It will go before all other commands
* that are already in the queue. Schedule requeue work under
@@ -164,6 +171,8 @@ static void __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
cmd->result = 0;
if (q->mq_ops) {
scsi_mq_requeue_cmd(cmd);
+ if (restart)
+ blk_mq_start_stopped_hw_queues(q, true);
return;
}
spin_lock_irqsave(q->queue_lock, flags);
--
1.7.9.5
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] scsi-mq: fix hw queue hang caused by timeout
2014-09-18 15:59 [PATCH] scsi-mq: fix hw queue hang caused by timeout Ming Lei
@ 2014-09-18 16:35 ` Christoph Hellwig
2014-09-18 17:03 ` Jens Axboe
0 siblings, 1 reply; 6+ messages in thread
From: Christoph Hellwig @ 2014-09-18 16:35 UTC (permalink / raw)
To: Ming Lei
Cc: James Bottomley, Jens Axboe, linux-scsi, linux-kernel,
Douglas Gilbert
On Thu, Sep 18, 2014 at 11:59:10PM +0800, Ming Lei wrote:
> If there are two requests or more timed out, the dispatch queue
> is put into stopped state and never be recoverd, and there
> is no such problem in non-mq mode.
>
> This patch trys to recover the stopped queue when the queue
> becomes unbusy, then the following retries can move on.
>
> Basically this patch maintains same behavior for this situation
> with non-mq mode.
This looks somewhat similar to the issues that Doug reported, and I remember
when he was last running into boot problems it was timeout related, too.
As far as the implementation is concerned I think the correct fix is
to clear the BLK_MQ_S_STOPPED queue flags in blk_mq_kick_requeue_list.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] scsi-mq: fix hw queue hang caused by timeout
2014-09-18 16:35 ` Christoph Hellwig
@ 2014-09-18 17:03 ` Jens Axboe
2014-09-19 13:07 ` Ming Lei
0 siblings, 1 reply; 6+ messages in thread
From: Jens Axboe @ 2014-09-18 17:03 UTC (permalink / raw)
To: Christoph Hellwig, Ming Lei
Cc: James Bottomley, linux-scsi, linux-kernel, Douglas Gilbert
On 2014-09-18 10:35, Christoph Hellwig wrote:
> On Thu, Sep 18, 2014 at 11:59:10PM +0800, Ming Lei wrote:
>> If there are two requests or more timed out, the dispatch queue
>> is put into stopped state and never be recoverd, and there
>> is no such problem in non-mq mode.
>>
>> This patch trys to recover the stopped queue when the queue
>> becomes unbusy, then the following retries can move on.
>>
>> Basically this patch maintains same behavior for this situation
>> with non-mq mode.
>
> This looks somewhat similar to the issues that Doug reported, and I remember
> when he was last running into boot problems it was timeout related, too.
>
> As far as the implementation is concerned I think the correct fix is
> to clear the BLK_MQ_S_STOPPED queue flags in blk_mq_kick_requeue_list.
Since that's the kick part of the requeue, auto-starting the queue for
that makes a lot of sense. I say that's the way we go.
--
Jens Axboe
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] scsi-mq: fix hw queue hang caused by timeout
2014-09-18 17:03 ` Jens Axboe
@ 2014-09-19 13:07 ` Ming Lei
2014-09-19 14:18 ` Ming Lei
0 siblings, 1 reply; 6+ messages in thread
From: Ming Lei @ 2014-09-19 13:07 UTC (permalink / raw)
To: Jens Axboe
Cc: Christoph Hellwig, James Bottomley, Linux SCSI List,
Linux Kernel Mailing List, Douglas Gilbert
On Fri, Sep 19, 2014 at 1:03 AM, Jens Axboe <axboe@fb.com> wrote:
> On 2014-09-18 10:35, Christoph Hellwig wrote:
>>
>> On Thu, Sep 18, 2014 at 11:59:10PM +0800, Ming Lei wrote:
>>>
>>> If there are two requests or more timed out, the dispatch queue
>>> is put into stopped state and never be recoverd, and there
>>> is no such problem in non-mq mode.
>>>
>>> This patch trys to recover the stopped queue when the queue
>>> becomes unbusy, then the following retries can move on.
>>>
>>> Basically this patch maintains same behavior for this situation
>>> with non-mq mode.
>>
>>
>> This looks somewhat similar to the issues that Doug reported, and I
>> remember
>> when he was last running into boot problems it was timeout related, too.
>>
>> As far as the implementation is concerned I think the correct fix is
>> to clear the BLK_MQ_S_STOPPED queue flags in blk_mq_kick_requeue_list.
>
>
> Since that's the kick part of the requeue, auto-starting the queue for that
> makes a lot of sense. I say that's the way we go.
Yeah, that looks better.
But it doesn't work after the simple change, and I need to
investigate further.
Thanks,
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] scsi-mq: fix hw queue hang caused by timeout
2014-09-19 13:07 ` Ming Lei
@ 2014-09-19 14:18 ` Ming Lei
2014-09-19 14:21 ` Jens Axboe
0 siblings, 1 reply; 6+ messages in thread
From: Ming Lei @ 2014-09-19 14:18 UTC (permalink / raw)
To: Jens Axboe
Cc: Christoph Hellwig, James Bottomley, Linux SCSI List,
Linux Kernel Mailing List, Douglas Gilbert
On Fri, Sep 19, 2014 at 9:07 PM, Ming Lei <ming.lei@canonical.com> wrote:
> On Fri, Sep 19, 2014 at 1:03 AM, Jens Axboe <axboe@fb.com> wrote:
>> On 2014-09-18 10:35, Christoph Hellwig wrote:
>>>
>>> On Thu, Sep 18, 2014 at 11:59:10PM +0800, Ming Lei wrote:
>>>>
>>>> If there are two requests or more timed out, the dispatch queue
>>>> is put into stopped state and never be recoverd, and there
>>>> is no such problem in non-mq mode.
>>>>
>>>> This patch trys to recover the stopped queue when the queue
>>>> becomes unbusy, then the following retries can move on.
>>>>
>>>> Basically this patch maintains same behavior for this situation
>>>> with non-mq mode.
>>>
>>>
>>> This looks somewhat similar to the issues that Doug reported, and I
>>> remember
>>> when he was last running into boot problems it was timeout related, too.
>>>
>>> As far as the implementation is concerned I think the correct fix is
>>> to clear the BLK_MQ_S_STOPPED queue flags in blk_mq_kick_requeue_list.
>>
>>
>> Since that's the kick part of the requeue, auto-starting the queue for that
>> makes a lot of sense. I say that's the way we go.
>
> Yeah, that looks better.
>
> But it doesn't work after the simple change, and I need to
> investigate further.
It is because of the timer miss, now it starts to work.
Thanks,
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] scsi-mq: fix hw queue hang caused by timeout
2014-09-19 14:18 ` Ming Lei
@ 2014-09-19 14:21 ` Jens Axboe
0 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2014-09-19 14:21 UTC (permalink / raw)
To: Ming Lei
Cc: Christoph Hellwig, James Bottomley, Linux SCSI List,
Linux Kernel Mailing List, Douglas Gilbert
On 09/19/2014 08:18 AM, Ming Lei wrote:
> On Fri, Sep 19, 2014 at 9:07 PM, Ming Lei <ming.lei@canonical.com> wrote:
>> On Fri, Sep 19, 2014 at 1:03 AM, Jens Axboe <axboe@fb.com> wrote:
>>> On 2014-09-18 10:35, Christoph Hellwig wrote:
>>>>
>>>> On Thu, Sep 18, 2014 at 11:59:10PM +0800, Ming Lei wrote:
>>>>>
>>>>> If there are two requests or more timed out, the dispatch queue
>>>>> is put into stopped state and never be recoverd, and there
>>>>> is no such problem in non-mq mode.
>>>>>
>>>>> This patch trys to recover the stopped queue when the queue
>>>>> becomes unbusy, then the following retries can move on.
>>>>>
>>>>> Basically this patch maintains same behavior for this situation
>>>>> with non-mq mode.
>>>>
>>>>
>>>> This looks somewhat similar to the issues that Doug reported, and I
>>>> remember
>>>> when he was last running into boot problems it was timeout related, too.
>>>>
>>>> As far as the implementation is concerned I think the correct fix is
>>>> to clear the BLK_MQ_S_STOPPED queue flags in blk_mq_kick_requeue_list.
>>>
>>>
>>> Since that's the kick part of the requeue, auto-starting the queue for that
>>> makes a lot of sense. I say that's the way we go.
>>
>> Yeah, that looks better.
>>
>> But it doesn't work after the simple change, and I need to
>> investigate further.
>
> It is because of the timer miss, now it starts to work.
Excellent. I think most new issues should be fixed in for-linus for
inclusion in this round. It's much bigger than I hoped for this late in
the cycle, but lots of us have run a lot of testing, so that's not a
huge worry.
--
Jens Axboe
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2014-09-19 14:21 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-18 15:59 [PATCH] scsi-mq: fix hw queue hang caused by timeout Ming Lei
2014-09-18 16:35 ` Christoph Hellwig
2014-09-18 17:03 ` Jens Axboe
2014-09-19 13:07 ` Ming Lei
2014-09-19 14:18 ` Ming Lei
2014-09-19 14:21 ` Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox