* Expected behaviour for device hang @ 2013-05-15 20:17 David.Darrington 2013-05-15 20:57 ` Keith Busch 0 siblings, 1 reply; 4+ messages in thread From: David.Darrington @ 2013-05-15 20:17 UTC (permalink / raw) What is the expected behaviour of the driver if a device hangs? If a device stops processing commands, the commands will eventually timeout, which is handled in 'nvme_kthread' with a call to 'nvme_cancel_ios'. However, this is not calling bio_completion. Every second the cycle repeats, cancelling the same I/Os and syslog fills up with the message 'Cancelling I/O xx'. I was expecting that the ios that timeout would be completed as failed and freed. Is there something that is still TBD, or am I just missing something. Thanks, Dave ^ permalink raw reply [flat|nested] 4+ messages in thread
* Expected behaviour for device hang 2013-05-15 20:17 Expected behaviour for device hang David.Darrington @ 2013-05-15 20:57 ` Keith Busch 2013-05-15 22:01 ` David.Darrington 2013-05-17 13:40 ` Matthew Wilcox 0 siblings, 2 replies; 4+ messages in thread From: Keith Busch @ 2013-05-15 20:57 UTC (permalink / raw) On Wed, 15 May 2013, David.Darrington@hgst.com wrote: > What is the expected behaviour of the driver if a device hangs? If a > device stops processing commands, the commands will eventually timeout, > which is handled in 'nvme_kthread' with a call to 'nvme_cancel_ios'. > However, this is not calling bio_completion. Every second the cycle > repeats, cancelling the same I/Os and syslog fills up with the message > 'Cancelling I/O xx'. I was expecting that the ios that timeout would be > completed as failed and freed. bio_endio is called using the 'fn' callback after cancelling the command, but the command id is not freed sense the controller still technically owns it. As fas as "Cancelling I/O' over and over, that should have been fixed in this patch: http://merlin.infradead.org/pipermail/linux-nvme/2013-April/000215.html I thought that one was applied in the last merge, but looks like it was missed. :( > Is there something that is still TBD, or am I just missing something. I think we may still have a probelm since ending the request releases the mapped resources and the controller may still dma to/from there. I have another patch to just reset the controller when an IO times out, but it is pending on the power management set since it is basically the same thing. ^ permalink raw reply [flat|nested] 4+ messages in thread
* Expected behaviour for device hang 2013-05-15 20:57 ` Keith Busch @ 2013-05-15 22:01 ` David.Darrington 2013-05-17 13:40 ` Matthew Wilcox 1 sibling, 0 replies; 4+ messages in thread From: David.Darrington @ 2013-05-15 22:01 UTC (permalink / raw) Thanks Keith, Is there a more targeted reset that we could try first after an I/O times out? We are worried about the case when the device has multiple namespaces. Depending on the H/W design, its possible that I/O to one namespace could hang while I/O to others continues to work. If we reset the controler, that would disrupt the working namespaces. Maybe we can try an abort first, if that times out, try to re-create the I/O queues associated with the stuck namespace, and if that doesn't work, reset the controller? Another question. The NVME spec states that a subsystem reset may be initiated by, among other things, 'a vendor specific event', and that the CSTS.NSSRO bit indicates that a subsystem reset has happened. Are there places that the driver should be checking this bit? Perhaps after I/O timeouts or other odd failures. If the bit is set, the driver could post an error and begin recovery without worrying that other parts of the device may still be working. Keith Busch <keith.busch at intel.com> 05/15/2013 03:57 PM To David.Darrington at hgst.com cc linux-nvme at lists.infradead.org Subject Re: Expected behaviour for device hang On Wed, 15 May 2013, David.Darrington@hgst.com wrote: > What is the expected behaviour of the driver if a device hangs? If a > device stops processing commands, the commands will eventually timeout, > which is handled in 'nvme_kthread' with a call to 'nvme_cancel_ios'. > However, this is not calling bio_completion. Every second the cycle > repeats, cancelling the same I/Os and syslog fills up with the message > 'Cancelling I/O xx'. I was expecting that the ios that timeout would be > completed as failed and freed. bio_endio is called using the 'fn' callback after cancelling the command, but the command id is not freed sense the controller still technically owns it. As fas as "Cancelling I/O' over and over, that should have been fixed in this patch: http://merlin.infradead.org/pipermail/linux-nvme/2013-April/000215.html I thought that one was applied in the last merge, but looks like it was missed. :( > Is there something that is still TBD, or am I just missing something. I think we may still have a probelm since ending the request releases the mapped resources and the controller may still dma to/from there. I have another patch to just reset the controller when an IO times out, but it is pending on the power management set since it is basically the same thing. ^ permalink raw reply [flat|nested] 4+ messages in thread
* Expected behaviour for device hang 2013-05-15 20:57 ` Keith Busch 2013-05-15 22:01 ` David.Darrington @ 2013-05-17 13:40 ` Matthew Wilcox 1 sibling, 0 replies; 4+ messages in thread From: Matthew Wilcox @ 2013-05-17 13:40 UTC (permalink / raw) On Wed, May 15, 2013@02:57:42PM -0600, Keith Busch wrote: > As fas as "Cancelling I/O' over and over, that should have been fixed > in this patch: > > http://merlin.infradead.org/pipermail/linux-nvme/2013-April/000215.html > > I thought that one was applied in the last merge, but looks like it was > missed. :( I thought so too. Sorry about that; applied now. ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2013-05-17 13:40 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-05-15 20:17 Expected behaviour for device hang David.Darrington 2013-05-15 20:57 ` Keith Busch 2013-05-15 22:01 ` David.Darrington 2013-05-17 13:40 ` Matthew Wilcox
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).