Expected behaviour for device hang

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* Expected behaviour for device hang
@ 2013-05-15 20:17 David.Darrington
  2013-05-15 20:57 ` Keith Busch
  0 siblings, 1 reply; 4+ messages in thread
From: David.Darrington @ 2013-05-15 20:17 UTC (permalink / raw)


What is the expected behaviour of the driver if a device hangs? If  a 
device stops processing commands, the commands will eventually timeout, 
which is handled in 'nvme_kthread' with a call to 'nvme_cancel_ios'. 
However, this is not calling bio_completion. Every second the cycle 
repeats, cancelling the same I/Os and syslog fills up with the message 
'Cancelling I/O xx'. I was expecting that the ios that timeout would be 
completed as failed and freed.

Is there something that is still TBD, or am I just missing something.

Thanks,
Dave

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Expected behaviour for device hang
  2013-05-15 20:17 Expected behaviour for device hang David.Darrington
@ 2013-05-15 20:57 ` Keith Busch
  2013-05-15 22:01   ` David.Darrington
  2013-05-17 13:40   ` Matthew Wilcox
  0 siblings, 2 replies; 4+ messages in thread
From: Keith Busch @ 2013-05-15 20:57 UTC (permalink / raw)

On Wed, 15 May 2013, David.Darrington@hgst.com wrote:
> What is the expected behaviour of the driver if a device hangs? If  a
> device stops processing commands, the commands will eventually timeout,
> which is handled in 'nvme_kthread' with a call to 'nvme_cancel_ios'.
> However, this is not calling bio_completion. Every second the cycle
> repeats, cancelling the same I/Os and syslog fills up with the message
> 'Cancelling I/O xx'. I was expecting that the ios that timeout would be
> completed as failed and freed.

bio_endio is called using the 'fn' callback after cancelling the command,
but the command id is not freed sense the controller still technically
owns it.

As fas as "Cancelling I/O' over and over, that should have been fixed
in this patch:

http://merlin.infradead.org/pipermail/linux-nvme/2013-April/000215.html

I thought that one was applied in the last merge, but looks like it was
missed. :(

> Is there something that is still TBD, or am I just missing something.

I think we may still have a probelm since ending the request releases the
mapped resources and the controller may still dma to/from there. I have
another patch to just reset the controller when an IO times out, but it is
pending on the power management set since it is basically the same thing.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Expected behaviour for device hang
  2013-05-15 20:57 ` Keith Busch
@ 2013-05-15 22:01   ` David.Darrington
  2013-05-17 13:40   ` Matthew Wilcox
  1 sibling, 0 replies; 4+ messages in thread
From: David.Darrington @ 2013-05-15 22:01 UTC (permalink / raw)

Thanks Keith,

Is there a more targeted reset that we could try first after an I/O times 
out? We are worried about the case when the device has multiple 
namespaces. Depending on the H/W design, its possible that I/O to one 
namespace could hang while I/O to others continues to work. If we reset 
the controler, that would disrupt the working namespaces. Maybe we can try 
an abort first, if that times out, try to re-create the I/O queues 
associated with the stuck namespace, and if that doesn't work, reset the 
controller?

Another question. The NVME spec states that a subsystem reset may be 
initiated by, among other things, 'a vendor specific event', and that the 
CSTS.NSSRO bit indicates that a subsystem reset has happened.  Are there 
places that the driver should be checking this bit? Perhaps after I/O 
timeouts or other odd failures. If the bit is set, the driver could post 
an error and begin recovery without worrying that other parts of the 
device may still be working. 

Keith Busch <keith.busch at intel.com> 
05/15/2013 03:57 PM

To
David.Darrington at hgst.com
cc
linux-nvme at lists.infradead.org
Subject
Re: Expected behaviour for device hang

On Wed, 15 May 2013, David.Darrington@hgst.com wrote:
> What is the expected behaviour of the driver if a device hangs? If  a
> device stops processing commands, the commands will eventually timeout,
> which is handled in 'nvme_kthread' with a call to 'nvme_cancel_ios'.
> However, this is not calling bio_completion. Every second the cycle
> repeats, cancelling the same I/Os and syslog fills up with the message
> 'Cancelling I/O xx'. I was expecting that the ios that timeout would be
> completed as failed and freed.

bio_endio is called using the 'fn' callback after cancelling the command,
but the command id is not freed sense the controller still technically
owns it.

As fas as "Cancelling I/O' over and over, that should have been fixed
in this patch:

http://merlin.infradead.org/pipermail/linux-nvme/2013-April/000215.html

I thought that one was applied in the last merge, but looks like it was
missed. :(

> Is there something that is still TBD, or am I just missing something.

I think we may still have a probelm since ending the request releases the
mapped resources and the controller may still dma to/from there. I have
another patch to just reset the controller when an IO times out, but it is
pending on the power management set since it is basically the same thing.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Expected behaviour for device hang
  2013-05-15 20:57 ` Keith Busch
  2013-05-15 22:01   ` David.Darrington
@ 2013-05-17 13:40   ` Matthew Wilcox
  1 sibling, 0 replies; 4+ messages in thread
From: Matthew Wilcox @ 2013-05-17 13:40 UTC (permalink / raw)


On Wed, May 15, 2013@02:57:42PM -0600, Keith Busch wrote:
> As fas as "Cancelling I/O' over and over, that should have been fixed
> in this patch:
> 
> http://merlin.infradead.org/pipermail/linux-nvme/2013-April/000215.html
> 
> I thought that one was applied in the last merge, but looks like it was
> missed. :(

I thought so too.  Sorry about that; applied now.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2013-05-17 13:40 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-05-15 20:17 Expected behaviour for device hang David.Darrington
2013-05-15 20:57 ` Keith Busch
2013-05-15 22:01   ` David.Darrington
2013-05-17 13:40   ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).