From mboxrd@z Thu Jan 1 00:00:00 1970 From: keith.busch@intel.com (Keith Busch) Date: Fri, 20 Dec 2013 12:41:08 -0700 (MST) Subject: [PATCH] NVMe: Use error handling on failed sync commands In-Reply-To: <20131220184710.GB4945@linux.intel.com> References: <1387563249-27871-1-git-send-email-keith.busch@intel.com> <20131220184710.GB4945@linux.intel.com> Message-ID: On Fri, 20 Dec 2013, Matthew Wilcox wrote: > On Fri, Dec 20, 2013@11:14:09AM -0700, Keith Busch wrote: >> Sync commands schedule an internal timeout to cancel rather than using >> the nvme timeout handler kthread. We should still try to recover so >> moving the check for cancelled commands after the error handling. > >> if (timeout && !time_after(now, info[cmdid].timeout)) >> continue; >> - if (info[cmdid].ctx == CMD_CTX_CANCELLED) >> - continue; >> if (timeout && nvmeq->dev->initialized) { >> nvme_abort_cmd(cmdid, nvmeq); >> continue; >> } >> + if (info[cmdid].ctx == CMD_CTX_CANCELLED) >> + continue; >> dev_warn(nvmeq->q_dmadev, "Cancelling I/O %d QID %d\n", cmdid, >> nvmeq->qid); >> ctx = cancel_cmdid(nvmeq, cmdid, &fn); > > I'm confused by this patch. Won't it cause us to send abort commands > repeatedly for commands IDs that have already been cancelled, but haven't > yet been completed as cancelled? It appears so, but 'nvme_abort_cmd' aborts only once then resets the controller if the command still isn't returned. The drive is not polled for timeouts when the reset handler is running so it won't timeout again, and the command is forced cancelled during reset with cancel_ios() 'timeout' set to false. 'sync_command' is the only place an IO can be cancelled while the device being polled for timeouts. I think we want to try recovering the unresponsive controller even if the sync timeout already cancelled. Without this patch, a failed sync command just leaks a cmdid until the controller is reset some other way.