From mboxrd@z Thu Jan 1 00:00:00 1970 From: krisman@linux.vnet.ibm.com (Gabriel Krisman Bertazi) Date: Thu, 11 Aug 2016 15:10:35 -0300 Subject: Oops when completing request on the wrong queue In-Reply-To: <20160811171655.GA5334@localhost.localdomain> (Keith Busch's message of "Thu, 11 Aug 2016 13:16:55 -0400") References: <87a8gltgks.fsf@linux.vnet.ibm.com> <20160811171655.GA5334@localhost.localdomain> Message-ID: <87d1lfrxb8.fsf@linux.vnet.ibm.com> Keith Busch writes: > On Wed, Aug 10, 2016@01:04:35AM -0300, Gabriel Krisman Bertazi wrote: >> Hi, >> >> We, IBM, have been experiencing eventual Oops when stressing IO at the >> same time we add/remove processors. The Oops happens in the IRQ path, >> when we try to complete a request that was apparently meant for another >> queue. >> >> In __nvme_process_cq, the driver will use the cqe.command_id and the >> nvmeq->tags to find out, via blk_mq_tag_to_rq, the request that >> initiated the IO. Eventually, it happens that the request returned by >> that function is not initialized, and we crash inside >> __blk_mq_complete_request, as shown below. > > Could you try the following patch and see if it resolves the issue? Hi Keith, Thanks for your response. I had tried this exact change already on 4.7 with no effect. Do you think doing it on 4.8-rc1 will yield better results? I also verified that the iod, when in __nvme_process_cq, points to the same queue that queued the command, as expected, but in nvme_timeout, according to the log I sent earlier, it is pointing to a different nvmeq (different nvmeq->qid). This is very strange to me. -- Gabriel Krisman Bertazi