Re: Oops when completing request on the wrong queue

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Oops when completing request on the wrong queue
       [not found] <87a8gltgks.fsf@linux.vnet.ibm.com>
@ 2016-08-19 13:28 ` Gabriel Krisman Bertazi
  2016-08-19 14:13   ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Gabriel Krisman Bertazi @ 2016-08-19 13:28 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, linux-nvme, Jens Axboe, Brian King,
	linux-block, linux-scsi

Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes:

> We, IBM, have been experiencing eventual Oops when stressing IO at the
> same time we add/remove processors.  The Oops happens in the IRQ path,
> when we try to complete a request that was apparently meant for another
> queue.
>
> In __nvme_process_cq, the driver will use the cqe.command_id and the
> nvmeq->tags to find out, via blk_mq_tag_to_rq, the request that
> initiated the IO.  Eventually, it happens that the request returned by
> that function is not initialized, and we crash inside
> __blk_mq_complete_request, as shown below.
>
> [ 2679.050701] Faulting instruction address: 0xc000000000549500
> [ 2679.050711] Oops: Kernel access of bad area, sig: 11 [#1]
> [ 2679.050716] SMP NR_CPUS=2048 NUMA pSeries
> [ 2679.050727] Modules linked in: minix nls_iso8859_1 xfs libcrc32c
> rpcsec_gss_krb5 auth_rpcgss
> nfsv4 nfs lockd grace fscache binfmt_misc pseries_rng rtc_generic sunrpc
> autofs4 btrfs xor
> raid6_pq ibmvscsi ibmveth nvme
> [ 2679.050771] CPU: 21 PID: 45045 Comm: ppc64_cpu Not tainted 4.4.0-22-generic #40-Ubuntu
> [ 2679.050780] task: c000000497b07e60 ti: c0000004fff2c000 task.ti: c000000451bc0000
> [ 2679.050786] NIP: c000000000549500 LR: d0000000046b5e84 CTR: c000000000549680
> [ 2679.050803] REGS: c0000004fff2fa80 TRAP: 0300   Not tainted  (4.4.0-22-generic)
> [ 2679.050807] MSR: 8000000100009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 28022428  XER: 00000000
> [ 2679.050824] CFAR: c000000000008468 DAR: 00000000000000b0 DSISR: 40000000 SOFTE: 0 
> GPR00: d0000000046b5e84 c0000004fff2fd00 c0000000015b4300 c0000002799a0a00 
> GPR04: 0000000000000000 0000000000000105 0000000000000004 0000000000000001 
> GPR08: c0000002799a0a50 0000000000000000 0000000000000000 d0000000046bdc68 
> GPR12: c000000000549680 c000000007b1c780 0000000000000008 0000000000000001 
> GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000010005b50 
> GPR20: 0000000010005dc8 c0000000015eaa60 0000000000000001 0000000000000002 
> GPR24: 00000000000001e3 0000000000000000 0000000000000000 ffffffffffffffff 
> GPR28: 0000000000000000 0000000000000000 c0000004f8cd5180 c0000002799a0a00 
> [ 2679.050908] NIP [c000000000549500] __blk_mq_complete_request+0x30/0x1b0
> [ 2679.050924] LR [d0000000046b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme]
> [ 2679.050930] Call Trace:
> [ 2679.050941] [c0000004fff2fd00] [c0000004fff2fd40] 0xc0000004fff2fd40 (unreliable)
> [ 2679.050952] [c0000004fff2fd40] [d0000000046b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme]
> [ 2679.050961] [c0000004fff2fde0] [d0000000046b613c] nvme_irq+0x3c/0xb0 [nvme]
> [ 2679.050972] [c0000004fff2fe10] [c000000000130660] handle_irq_event_percpu+0xa0/0x320
> [ 2679.050981] [c0000004fff2fed0] [c000000000130948] handle_irq_event+0x68/0xc0
> [ 2679.050989] [c0000004fff2ff00] [c000000000135c2c] handle_fasteoi_irq+0xec/0x2b0
> [ 2679.050997] [c0000004fff2ff30] [c00000000012f844] generic_handle_irq+0x54/0x80
> [ 2679.051007] [c0000004fff2ff60] [c000000000011320] __do_irq+0x80/0x1d0
> [ 2679.051020] [c0000004fff2ff90] [c000000000024800] call_do_irq+0x14/0x24
> [ 2679.051037] [c000000451bc3350] [c000000000011508] do_IRQ+0x98/0x140
> [ 2679.051054] [c000000451bc33a0] [c000000000002594]
> hardware_interrupt_common+0x114/0x180
>
> I added some debugging which skipped the Oops and made me think that we
> are fetching the wrong request because we are actually looking at the
> wrong set of tags.
>
> The log below exemplifies what I am saying.  Instead of crashing in the
> Oops, I made the execution skip the request completion and simply
> consider that odd CQE as handled.  The first line of the log does that.
>
>>  nvme nvme0: Skip completion of I/O 309 on queue 35 due to missing q
>>  nvme nvme0: I/O 309 QID 63 timeout, aborting
>>  nvme nvme0: Abort status: 0x0
>>  nvme nvme0: I/O 309 QID 63 timeout, reset controller
>>  nvme nvme0: completing aborted command with status: fffffffc
>>  nvme 0003:03:00.0: Using 64-bit DMA iommu bypass
>>  blk_update_request: I/O error, dev nvme0n1, sector 2105537536
>
> In the first line, we found the request 309 for queue 35, which would
> have crashed the execution.  My debug patch just ignores it.
>
> Then, we can see that eventually, the same IO, 309, will timeout in
> another QID, which was actually expecting for it.  The Abort request
> gets sent and completed, but we never receive the completion of the
> aborted request, thus we timeout again and restart the device.
>
> This only happens when we are changing the SMT settings very
> frequently, which points back to the way we correlate the hctx->tags to
> the nvmeq structure, but I failed to find the exact code spot were
> things get messed up.
>
> I've put a some effort on this code lately, but I definitely need some
> assistance on this one.  Since, as far as I understand, this issue is
> much more likely to reproduce on boxes with a lot of hotpluggable
> processors, I'm willing to test any patches you may find suitable in our
> systems.
>

Hi,

[Adding linux-block, linux-scsi and Jens in CC for blk-mq input]

Made some progress on tracking down the issue.  It's looking more of a
blk-mq bug than nvme to me now.  Let me know if I made any wrong
assumption below:

I started to wonder whether blk-mq is requesting the IO in a wrong queue
or if the request is being moved to another queue during hotplug, as it
would happen in blk_mq_hctx_cpu_offline.

I verified that when the fail occurs, the request never was never
touched in blk_mq_hctx_cpu_offline.  Also, and more interesting, I added
the following BUG_ON to nvme_queue_req:

    BUG_ON(blk_mq_tag_to_rq(*nvmeq->tags, req->tag) != req);

>From my experiments, I could never hit this during normal operation
(without SMT). But, as soon as I start triggering hotplug changes, I hit
it pretty fast.  By changing to WARN_ON, I see that the Oops will happen
almost immediately after this request is submitted, once __process_cq
tries to complete it.

Also, when adding the patch below, we send a few -EIOs back to upper
layers, but it prevents the crash.

I also think i saw a similar hang when running on a virtio_scsi disk.  I
couldn't reproduce it a second time, so we still lack confirmation, but
if confirmed, it indeed points away from nvme and back to the block
layer.

Any suggestions to further track this down?


diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b927ecf..10f2257 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -599,6 +599,11 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
        if (ret)
                return ret;

+       if(blk_mq_tag_to_rq(*nvmeq->tags, req->tag) != req) {
+         ret = BLK_MQ_RQ_QUEUE_ERROR;
+         goto out;
+       }
+
        ret = nvme_setup_cmd(ns, req, &cmnd);
        if (ret)
                goto out;


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: Oops when completing request on the wrong queue
  2016-08-19 13:28 ` Oops when completing request on the wrong queue Gabriel Krisman Bertazi
@ 2016-08-19 14:13   ` Jens Axboe
  2016-08-19 15:51     ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2016-08-19 14:13 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi, Keith Busch
  Cc: Christoph Hellwig, linux-nvme, Brian King, linux-block,
	linux-scsi

On 08/19/2016 07:28 AM, Gabriel Krisman Bertazi wrote:
> Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes:
>
>> We, IBM, have been experiencing eventual Oops when stressing IO at the
>> same time we add/remove processors.  The Oops happens in the IRQ path,
>> when we try to complete a request that was apparently meant for another
>> queue.
>>
>> In __nvme_process_cq, the driver will use the cqe.command_id and the
>> nvmeq->tags to find out, via blk_mq_tag_to_rq, the request that
>> initiated the IO.  Eventually, it happens that the request returned by
>> that function is not initialized, and we crash inside
>> __blk_mq_complete_request, as shown below.
>>
>> [ 2679.050701] Faulting instruction address: 0xc000000000549500
>> [ 2679.050711] Oops: Kernel access of bad area, sig: 11 [#1]
>> [ 2679.050716] SMP NR_CPUS=2048 NUMA pSeries
>> [ 2679.050727] Modules linked in: minix nls_iso8859_1 xfs libcrc32c
>> rpcsec_gss_krb5 auth_rpcgss
>> nfsv4 nfs lockd grace fscache binfmt_misc pseries_rng rtc_generic sunrpc
>> autofs4 btrfs xor
>> raid6_pq ibmvscsi ibmveth nvme
>> [ 2679.050771] CPU: 21 PID: 45045 Comm: ppc64_cpu Not tainted 4.4.0-22-generic #40-Ubuntu
>> [ 2679.050780] task: c000000497b07e60 ti: c0000004fff2c000 task.ti: c000000451bc0000
>> [ 2679.050786] NIP: c000000000549500 LR: d0000000046b5e84 CTR: c000000000549680
>> [ 2679.050803] REGS: c0000004fff2fa80 TRAP: 0300   Not tainted  (4.4.0-22-generic)
>> [ 2679.050807] MSR: 8000000100009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 28022428  XER: 00000000
>> [ 2679.050824] CFAR: c000000000008468 DAR: 00000000000000b0 DSISR: 40000000 SOFTE: 0
>> GPR00: d0000000046b5e84 c0000004fff2fd00 c0000000015b4300 c0000002799a0a00
>> GPR04: 0000000000000000 0000000000000105 0000000000000004 0000000000000001
>> GPR08: c0000002799a0a50 0000000000000000 0000000000000000 d0000000046bdc68
>> GPR12: c000000000549680 c000000007b1c780 0000000000000008 0000000000000001
>> GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000010005b50
>> GPR20: 0000000010005dc8 c0000000015eaa60 0000000000000001 0000000000000002
>> GPR24: 00000000000001e3 0000000000000000 0000000000000000 ffffffffffffffff
>> GPR28: 0000000000000000 0000000000000000 c0000004f8cd5180 c0000002799a0a00
>> [ 2679.050908] NIP [c000000000549500] __blk_mq_complete_request+0x30/0x1b0
>> [ 2679.050924] LR [d0000000046b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme]
>> [ 2679.050930] Call Trace:
>> [ 2679.050941] [c0000004fff2fd00] [c0000004fff2fd40] 0xc0000004fff2fd40 (unreliable)
>> [ 2679.050952] [c0000004fff2fd40] [d0000000046b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme]
>> [ 2679.050961] [c0000004fff2fde0] [d0000000046b613c] nvme_irq+0x3c/0xb0 [nvme]
>> [ 2679.050972] [c0000004fff2fe10] [c000000000130660] handle_irq_event_percpu+0xa0/0x320
>> [ 2679.050981] [c0000004fff2fed0] [c000000000130948] handle_irq_event+0x68/0xc0
>> [ 2679.050989] [c0000004fff2ff00] [c000000000135c2c] handle_fasteoi_irq+0xec/0x2b0
>> [ 2679.050997] [c0000004fff2ff30] [c00000000012f844] generic_handle_irq+0x54/0x80
>> [ 2679.051007] [c0000004fff2ff60] [c000000000011320] __do_irq+0x80/0x1d0
>> [ 2679.051020] [c0000004fff2ff90] [c000000000024800] call_do_irq+0x14/0x24
>> [ 2679.051037] [c000000451bc3350] [c000000000011508] do_IRQ+0x98/0x140
>> [ 2679.051054] [c000000451bc33a0] [c000000000002594]
>> hardware_interrupt_common+0x114/0x180
>>
>> I added some debugging which skipped the Oops and made me think that we
>> are fetching the wrong request because we are actually looking at the
>> wrong set of tags.
>>
>> The log below exemplifies what I am saying.  Instead of crashing in the
>> Oops, I made the execution skip the request completion and simply
>> consider that odd CQE as handled.  The first line of the log does that.
>>
>>>  nvme nvme0: Skip completion of I/O 309 on queue 35 due to missing q
>>>  nvme nvme0: I/O 309 QID 63 timeout, aborting
>>>  nvme nvme0: Abort status: 0x0
>>>  nvme nvme0: I/O 309 QID 63 timeout, reset controller
>>>  nvme nvme0: completing aborted command with status: fffffffc
>>>  nvme 0003:03:00.0: Using 64-bit DMA iommu bypass
>>>  blk_update_request: I/O error, dev nvme0n1, sector 2105537536
>>
>> In the first line, we found the request 309 for queue 35, which would
>> have crashed the execution.  My debug patch just ignores it.
>>
>> Then, we can see that eventually, the same IO, 309, will timeout in
>> another QID, which was actually expecting for it.  The Abort request
>> gets sent and completed, but we never receive the completion of the
>> aborted request, thus we timeout again and restart the device.
>>
>> This only happens when we are changing the SMT settings very
>> frequently, which points back to the way we correlate the hctx->tags to
>> the nvmeq structure, but I failed to find the exact code spot were
>> things get messed up.
>>
>> I've put a some effort on this code lately, but I definitely need some
>> assistance on this one.  Since, as far as I understand, this issue is
>> much more likely to reproduce on boxes with a lot of hotpluggable
>> processors, I'm willing to test any patches you may find suitable in our
>> systems.
>>
>
> Hi,
>
> [Adding linux-block, linux-scsi and Jens in CC for blk-mq input]
>
> Made some progress on tracking down the issue.  It's looking more of a
> blk-mq bug than nvme to me now.  Let me know if I made any wrong
> assumption below:
>
> I started to wonder whether blk-mq is requesting the IO in a wrong queue
> or if the request is being moved to another queue during hotplug, as it
> would happen in blk_mq_hctx_cpu_offline.
>
> I verified that when the fail occurs, the request never was never
> touched in blk_mq_hctx_cpu_offline.  Also, and more interesting, I added
> the following BUG_ON to nvme_queue_req:
>
>     BUG_ON(blk_mq_tag_to_rq(*nvmeq->tags, req->tag) != req);
>
> From my experiments, I could never hit this during normal operation
> (without SMT). But, as soon as I start triggering hotplug changes, I hit
> it pretty fast.  By changing to WARN_ON, I see that the Oops will happen
> almost immediately after this request is submitted, once __process_cq
> tries to complete it.
>
> Also, when adding the patch below, we send a few -EIOs back to upper
> layers, but it prevents the crash.
>
> I also think i saw a similar hang when running on a virtio_scsi disk.  I
> couldn't reproduce it a second time, so we still lack confirmation, but
> if confirmed, it indeed points away from nvme and back to the block
> layer.
>
> Any suggestions to further track this down?

Some good detective work so far! I agree, this looks like a blk-mq core
bug. Do you have a trace of a BUG() triggering in nvme_queue_rq(), when
req->tag != nvmeq->tags? I don't immediately see how this could happen,
the freezing should protect us from this, unless it's broken somehow.
I'll take a look at this.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Oops when completing request on the wrong queue
  2016-08-19 14:13   ` Jens Axboe
@ 2016-08-19 15:51     ` Jens Axboe
  2016-08-19 16:38       ` Gabriel Krisman Bertazi
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2016-08-19 15:51 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi, Keith Busch
  Cc: Christoph Hellwig, linux-nvme, Brian King, linux-block,
	linux-scsi

On 08/19/2016 08:13 AM, Jens Axboe wrote:
> On 08/19/2016 07:28 AM, Gabriel Krisman Bertazi wrote:
>> Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes:
>>
>>> We, IBM, have been experiencing eventual Oops when stressing IO at the
>>> same time we add/remove processors.  The Oops happens in the IRQ path,
>>> when we try to complete a request that was apparently meant for another
>>> queue.
>>>
>>> In __nvme_process_cq, the driver will use the cqe.command_id and the
>>> nvmeq->tags to find out, via blk_mq_tag_to_rq, the request that
>>> initiated the IO.  Eventually, it happens that the request returned by
>>> that function is not initialized, and we crash inside
>>> __blk_mq_complete_request, as shown below.
>>>
>>> [ 2679.050701] Faulting instruction address: 0xc000000000549500
>>> [ 2679.050711] Oops: Kernel access of bad area, sig: 11 [#1]
>>> [ 2679.050716] SMP NR_CPUS=2048 NUMA pSeries
>>> [ 2679.050727] Modules linked in: minix nls_iso8859_1 xfs libcrc32c
>>> rpcsec_gss_krb5 auth_rpcgss
>>> nfsv4 nfs lockd grace fscache binfmt_misc pseries_rng rtc_generic sunrpc
>>> autofs4 btrfs xor
>>> raid6_pq ibmvscsi ibmveth nvme
>>> [ 2679.050771] CPU: 21 PID: 45045 Comm: ppc64_cpu Not tainted
>>> 4.4.0-22-generic #40-Ubuntu
>>> [ 2679.050780] task: c000000497b07e60 ti: c0000004fff2c000 task.ti:
>>> c000000451bc0000
>>> [ 2679.050786] NIP: c000000000549500 LR: d0000000046b5e84 CTR:
>>> c000000000549680
>>> [ 2679.050803] REGS: c0000004fff2fa80 TRAP: 0300   Not tainted
>>> (4.4.0-22-generic)
>>> [ 2679.050807] MSR: 8000000100009033 <SF,EE,ME,IR,DR,RI,LE>  CR:
>>> 28022428  XER: 00000000
>>> [ 2679.050824] CFAR: c000000000008468 DAR: 00000000000000b0 DSISR:
>>> 40000000 SOFTE: 0
>>> GPR00: d0000000046b5e84 c0000004fff2fd00 c0000000015b4300
>>> c0000002799a0a00
>>> GPR04: 0000000000000000 0000000000000105 0000000000000004
>>> 0000000000000001
>>> GPR08: c0000002799a0a50 0000000000000000 0000000000000000
>>> d0000000046bdc68
>>> GPR12: c000000000549680 c000000007b1c780 0000000000000008
>>> 0000000000000001
>>> GPR16: 0000000000000000 0000000000000000 0000000000000000
>>> 0000000010005b50
>>> GPR20: 0000000010005dc8 c0000000015eaa60 0000000000000001
>>> 0000000000000002
>>> GPR24: 00000000000001e3 0000000000000000 0000000000000000
>>> ffffffffffffffff
>>> GPR28: 0000000000000000 0000000000000000 c0000004f8cd5180
>>> c0000002799a0a00
>>> [ 2679.050908] NIP [c000000000549500]
>>> __blk_mq_complete_request+0x30/0x1b0
>>> [ 2679.050924] LR [d0000000046b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme]
>>> [ 2679.050930] Call Trace:
>>> [ 2679.050941] [c0000004fff2fd00] [c0000004fff2fd40]
>>> 0xc0000004fff2fd40 (unreliable)
>>> [ 2679.050952] [c0000004fff2fd40] [d0000000046b5e84]
>>> __nvme_process_cq+0xf4/0x2c0 [nvme]
>>> [ 2679.050961] [c0000004fff2fde0] [d0000000046b613c]
>>> nvme_irq+0x3c/0xb0 [nvme]
>>> [ 2679.050972] [c0000004fff2fe10] [c000000000130660]
>>> handle_irq_event_percpu+0xa0/0x320
>>> [ 2679.050981] [c0000004fff2fed0] [c000000000130948]
>>> handle_irq_event+0x68/0xc0
>>> [ 2679.050989] [c0000004fff2ff00] [c000000000135c2c]
>>> handle_fasteoi_irq+0xec/0x2b0
>>> [ 2679.050997] [c0000004fff2ff30] [c00000000012f844]
>>> generic_handle_irq+0x54/0x80
>>> [ 2679.051007] [c0000004fff2ff60] [c000000000011320] __do_irq+0x80/0x1d0
>>> [ 2679.051020] [c0000004fff2ff90] [c000000000024800]
>>> call_do_irq+0x14/0x24
>>> [ 2679.051037] [c000000451bc3350] [c000000000011508] do_IRQ+0x98/0x140
>>> [ 2679.051054] [c000000451bc33a0] [c000000000002594]
>>> hardware_interrupt_common+0x114/0x180
>>>
>>> I added some debugging which skipped the Oops and made me think that we
>>> are fetching the wrong request because we are actually looking at the
>>> wrong set of tags.
>>>
>>> The log below exemplifies what I am saying.  Instead of crashing in the
>>> Oops, I made the execution skip the request completion and simply
>>> consider that odd CQE as handled.  The first line of the log does that.
>>>
>>>>  nvme nvme0: Skip completion of I/O 309 on queue 35 due to missing q
>>>>  nvme nvme0: I/O 309 QID 63 timeout, aborting
>>>>  nvme nvme0: Abort status: 0x0
>>>>  nvme nvme0: I/O 309 QID 63 timeout, reset controller
>>>>  nvme nvme0: completing aborted command with status: fffffffc
>>>>  nvme 0003:03:00.0: Using 64-bit DMA iommu bypass
>>>>  blk_update_request: I/O error, dev nvme0n1, sector 2105537536
>>>
>>> In the first line, we found the request 309 for queue 35, which would
>>> have crashed the execution.  My debug patch just ignores it.
>>>
>>> Then, we can see that eventually, the same IO, 309, will timeout in
>>> another QID, which was actually expecting for it.  The Abort request
>>> gets sent and completed, but we never receive the completion of the
>>> aborted request, thus we timeout again and restart the device.
>>>
>>> This only happens when we are changing the SMT settings very
>>> frequently, which points back to the way we correlate the hctx->tags to
>>> the nvmeq structure, but I failed to find the exact code spot were
>>> things get messed up.
>>>
>>> I've put a some effort on this code lately, but I definitely need some
>>> assistance on this one.  Since, as far as I understand, this issue is
>>> much more likely to reproduce on boxes with a lot of hotpluggable
>>> processors, I'm willing to test any patches you may find suitable in our
>>> systems.
>>>
>>
>> Hi,
>>
>> [Adding linux-block, linux-scsi and Jens in CC for blk-mq input]
>>
>> Made some progress on tracking down the issue.  It's looking more of a
>> blk-mq bug than nvme to me now.  Let me know if I made any wrong
>> assumption below:
>>
>> I started to wonder whether blk-mq is requesting the IO in a wrong queue
>> or if the request is being moved to another queue during hotplug, as it
>> would happen in blk_mq_hctx_cpu_offline.
>>
>> I verified that when the fail occurs, the request never was never
>> touched in blk_mq_hctx_cpu_offline.  Also, and more interesting, I added
>> the following BUG_ON to nvme_queue_req:
>>
>>     BUG_ON(blk_mq_tag_to_rq(*nvmeq->tags, req->tag) != req);
>>
>> From my experiments, I could never hit this during normal operation
>> (without SMT). But, as soon as I start triggering hotplug changes, I hit
>> it pretty fast.  By changing to WARN_ON, I see that the Oops will happen
>> almost immediately after this request is submitted, once __process_cq
>> tries to complete it.
>>
>> Also, when adding the patch below, we send a few -EIOs back to upper
>> layers, but it prevents the crash.
>>
>> I also think i saw a similar hang when running on a virtio_scsi disk.  I
>> couldn't reproduce it a second time, so we still lack confirmation, but
>> if confirmed, it indeed points away from nvme and back to the block
>> layer.
>>
>> Any suggestions to further track this down?
>
> Some good detective work so far! I agree, this looks like a blk-mq core
> bug. Do you have a trace of a BUG() triggering in nvme_queue_rq(), when
> req->tag != nvmeq->tags? I don't immediately see how this could happen,
> the freezing should protect us from this, unless it's broken somehow.
> I'll take a look at this.

Can you share what you ran to online/offline CPUs? I can't reproduce
this here.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Oops when completing request on the wrong queue
  2016-08-19 15:51     ` Jens Axboe
@ 2016-08-19 16:38       ` Gabriel Krisman Bertazi
  2016-08-23 20:54         ` Gabriel Krisman Bertazi
  0 siblings, 1 reply; 12+ messages in thread
From: Gabriel Krisman Bertazi @ 2016-08-19 16:38 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Christoph Hellwig, linux-nvme, Brian King,
	linux-block, linux-scsi

Jens Axboe <axboe@kernel.dk> writes:

>> Some good detective work so far! I agree, this looks like a blk-mq core
>> bug. Do you have a trace of a BUG() triggering in nvme_queue_rq(), when
>> req->tag != nvmeq->tags? I don't immediately see how this could happen,
>> the freezing should protect us from this, unless it's broken somehow.
>> I'll take a look at this.

Nothing much on the BUG() stack trace, at least to me.  We are coming
from an async execution of __blk_mq_run_hw_queue:

This is from 4.7. I'll give another run on 4.8-rc1.

[ 1953.910860] kernel BUG at ../drivers/nvme/host/pci.c:602!
cpu 0x48: Vector: 700 (Program Check) at [c000007870d3b7a0]
    pc: d000000035aa2e88: nvme_queue_rq+0x1a8/0x7a0 [nvme]
    lr: d000000035aa2e78: nvme_queue_rq+0x198/0x7a0 [nvme]
    sp: c000007870d3ba20
   msr: 900000010282b033
  current = 0xc000007870d9e400
  paca    = 0xc000000002bd8800   softe: 0        irq_happened: 0x01
    pid   = 9205, comm = kworker/72:1H
kernel BUG at ../drivers/nvme/host/pci.c:602!
Linux version 4.7.0.mainline+ (root@iod76) (gcc version 5.3.1 20160413 (Ubuntu/IBM 5.3.1-14ubuntu2.
1) ) #150 SMP Wed Aug 17 14:53:47 CDT 2016
enter ? for help
[c000007870d3bb10] c0000000005697b4 __blk_mq_run_hw_queue+0x384/0x640
[c000007870d3bc50] c0000000000e8cf0 process_one_work+0x1e0/0x590
[c000007870d3bce0] c0000000000e9148 worker_thread+0xa8/0x660
[c000007870d3bd80] c0000000000f2090 kthread+0x110/0x130
[c000007870d3be30] c0000000000095f0 ret_from_kernel_thread+0x5c/0x6c
48:mon>


Another interesting data point is that we always reach the WARN_ON in
__blk_mq_run_hw_queue immediately before submitting the bad request.

Maybe we are touching some percpu variable from the wrong cpu?

See the WARN_ON trace below.


[  369.953825] ------------[ cut here ]------------
[  369.954011] WARNING: CPU: 142 PID: 9475 at ../block/blk-mq.c:840 __blk_mq_run_hw_queue+0x338/0x6
40
[  369.954139] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache binfmt_
misc xfs ofpart cmdlinepart powernv_flash mtd ipmi_powernv ipmi_msghandler at24 opal_prd ibmpowernv
 nvmem_core powernv_rng input_leds joydev uio_pdrv_genirq mac_hid uio ib_iser rdma_cm sunrpc iw_cm
ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 ra
id456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0
 multipath linear ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
 drm hid_generic uas usb_storage usbhid hid nvme ahci libahci nvme_core drbg ansi_cprng
[  369.955011] CPU: 142 PID: 9475 Comm: kworker/142:1H Not tainted 4.7.0.mainline+ #150
[  369.955085] Workqueue: kblockd blk_mq_run_work_fn
[  369.955153] task: c0000078cbb89a00 ti: c0000078cb024000 task.ti: c0000078cb024000
[  369.955253] NIP: c000000000569768 LR: c000000000569668 CTR: 0000000000000000
[  369.955314] REGS: c0000078cb027890 TRAP: 0700   Not tainted  (4.7.0.mainline+)
[  369.955372] MSR: 900000010282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>  CR: 28002224  XER:
 20000000
[  369.955579] CFAR: c000000000569a18 SOFTE: 1
GPR00: c0000000005695b0 c0000078cb027b10 c0000000013d0b00 0000000000000008
GPR04: 0000000000000008 0000000000000040 0000000000000000 0000000000000001
GPR08: c0000078c345d800 0000000000000001 0000000001700000 9000000102803003
GPR12: 0000000000002200 c000000002bffe00 c0000078bb96b000 0000000000000000
GPR16: 0000000000000008 c0000078bb930000 c000007954aafd00 c0000078bb961800
GPR20: c000007954aafd08 c000000000da3728 c0000078bea80000 c0000078cb027b88
GPR24: 0000000000000000 c0000078bea80000 c0000078bb961808 fffffffffffffffb
GPR28: 0000000000000000 c0000078cb027b98 0000000000000000 c0000078bd299700
[  369.956334] NIP [c000000000569768] __blk_mq_run_hw_queue+0x338/0x640
[  369.956383] LR [c000000000569668] __blk_mq_run_hw_queue+0x238/0x640
[  369.956433] Call Trace:
[  369.956454] [c0000078cb027b10] [c0000000005695b0] __blk_mq_run_hw_queue+0x180/0x640 (unreliable)
[  369.956535] [c0000078cb027c50] [c0000000000e8cf0] process_one_work+0x1e0/0x590
[  369.956605] [c0000078cb027ce0] [c0000000000e9148] worker_thread+0xa8/0x660
[  369.956664] [c0000078cb027d80] [c0000000000f2090] kthread+0x110/0x130
[  369.956723] [c0000078cb027e30] [c0000000000095f0] ret_from_kernel_thread+0x5c/0x6c
[  369.956791] Instruction dump:
[  369.956821] 815f0108 e91302d8 81280000 7f895040 409d01e8 e9280060 794a1f24 7d29502a
[  369.956920] 7d2afa78 312affff 7d295110 7d2907b4 <0b090000> e9210088 e95f0058 38810060
[  369.957020] ---[ end trace 1398dd60e3057065 ]---
[  369.957094] ------------[ cut here ]------------


> Can you share what you ran to online/offline CPUs? I can't reproduce
> this here.

I was using the ppc64_cpu tool, which shouldn't do nothing more than
write to sysfs.  but I just reproduced it with the script below.

Note that this is ppc64le.  I don't have a x86 in hand to attempt to
reproduce right now, but I'll look for one and see how it goes.


#!/bin/bash

MAXCPUs=159
STATE=1

while true; do
    for i in $(seq 1 ${MAXCPUS}); do
        if (($i%8)); then
            echo $STATE > /sys/devices/system/cpu/cpu$i/online
        fi
    done
    if [[ $STATE -eq 1 ]]; then
        STATE=0
    else
        STATE=1
    fi
done

-- 
Gabriel Krisman Bertazi


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Oops when completing request on the wrong queue
  2016-08-19 16:38       ` Gabriel Krisman Bertazi
@ 2016-08-23 20:54         ` Gabriel Krisman Bertazi
  2016-08-23 21:11           ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Gabriel Krisman Bertazi @ 2016-08-23 20:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Christoph Hellwig, linux-nvme, Brian King,
	linux-block, linux-scsi

Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes:

>> Can you share what you ran to online/offline CPUs? I can't reproduce
>> this here.
>
> I was using the ppc64_cpu tool, which shouldn't do nothing more than
> write to sysfs.  but I just reproduced it with the script below.
>
> Note that this is ppc64le.  I don't have a x86 in hand to attempt to
> reproduce right now, but I'll look for one and see how it goes.

Hi,

Any luck on reproducing it?  We were initially reproducing with a
proprietary stress test, but I gave a try to a generated fio jobfile
associated with the SMT script I shared earlier and I could reproduce
the crash consistently in less than 10 minutes of execution.  this was
still ppc64le, though.  I couldn't get my hands on nvme on x86 yet.

The job file I used, as well as the smt.sh script, in case you want to
give it a try:

jobfile: http://krisman.be/k/nvmejob.fio
smt.sh:  http://krisman.be/k/smt.sh

Still, the trigger seems to be consistently a heavy load of IO
associated with CPU addition/removal.

Let me share my progress from the last couple days in the hope that it
rings a bell for you.

Firstly, I verified that when we hit the BUG_ON in nvme_queue_rq, the
request_queue's freeze_depth is 0, which points away from a fault in the
freeze/unfreeze mechanism.  If a request was escaping and going through
the block layer during a freeze, we'd see freeze_depth >= 1.  Before
that, I had also tried to keep the q_usage_counter in atomic mode, in
case of a bug in the percpu refcount.  No luck, the BUG_ON was still
hit.

Also, I don't see anything special about the request that reaches the
BUG_ON.  It's a REQ_TYPE_FS request and, at least in the last time I
reproduced, it was a READ that came from the stress test task through
submit_bio.  So nothing remarkable about it too, as far as I can see.

I'm still thinking about a case in which the mapping get's screwed up,
where a ctx would appear into two hctxs bitmaps after a remap, or if the
ctx got remaped to another hctx.  I'm still learning my way through the
cpumap code, so I'm not sure it's a real possibility, but I'm not
convinced it isn't.  Some preliminary tests don't suggest it's the case
at play, but I wanna spend a little more time on this theory (maybe
for my lack of better ideas :)

On a side note, probably unrelated to this crash, it also got me
thinking about the current usefulness of blk_mq_hctx_notify.  Since CPU
is dead, no more requests would be coming through its ctx.  I think we
could force a queue run in blk_mq_queue_reinit_notify, before remapping,
which would cause the hctx to fetch the remaining requests from that
dead ctx (since it's not unmapped yet).  This way, we could maintain a
single hotplug notification hook and simplify the hotplug path.  I
haven't written code for it yet, but I'll see if I can come up with
something and send to the list.

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Oops when completing request on the wrong queue
  2016-08-23 20:54         ` Gabriel Krisman Bertazi
@ 2016-08-23 21:11           ` Jens Axboe
  2016-08-23 21:14             ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2016-08-23 21:11 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Keith Busch, Christoph Hellwig, linux-nvme, Brian King,
	linux-block, linux-scsi

On 08/23/2016 02:54 PM, Gabriel Krisman Bertazi wrote:
> Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes:
>
>>> Can you share what you ran to online/offline CPUs? I can't reproduce
>>> this here.
>>
>> I was using the ppc64_cpu tool, which shouldn't do nothing more than
>> write to sysfs.  but I just reproduced it with the script below.
>>
>> Note that this is ppc64le.  I don't have a x86 in hand to attempt to
>> reproduce right now, but I'll look for one and see how it goes.
>
> Hi,
>
> Any luck on reproducing it?  We were initially reproducing with a
> proprietary stress test, but I gave a try to a generated fio jobfile
> associated with the SMT script I shared earlier and I could reproduce
> the crash consistently in less than 10 minutes of execution.  this was
> still ppc64le, though.  I couldn't get my hands on nvme on x86 yet.

Nope, I have not been able to reproduce it. How long does the CPU
offline/online actions take on ppc64? It's pretty slow on x86, which may
hide the issue. I took out the various printk's associated with bringing
a CPU off/online, as well as IRQ breaking parts, but didn't help in
reproducing it.

> The job file I used, as well as the smt.sh script, in case you want to
> give it a try:
>
> jobfile: http://krisman.be/k/nvmejob.fio
> smt.sh:  http://krisman.be/k/smt.sh
>
> Still, the trigger seems to be consistently a heavy load of IO
> associated with CPU addition/removal.

My workload looks similar to yours, in that it's high depth and with a
lot of jobs to keep most CPUs loaded. My bash script is different than
yours, I'll try that and see if it helps here.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Oops when completing request on the wrong queue
  2016-08-23 21:11           ` Jens Axboe
@ 2016-08-23 21:14             ` Jens Axboe
  2016-08-23 22:49               ` Keith Busch
  2016-08-24 18:34               ` Jens Axboe
  0 siblings, 2 replies; 12+ messages in thread
From: Jens Axboe @ 2016-08-23 21:14 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Keith Busch, Christoph Hellwig, linux-nvme, Brian King,
	linux-block, linux-scsi

On 08/23/2016 03:11 PM, Jens Axboe wrote:
> On 08/23/2016 02:54 PM, Gabriel Krisman Bertazi wrote:
>> Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes:
>>
>>>> Can you share what you ran to online/offline CPUs? I can't reproduce
>>>> this here.
>>>
>>> I was using the ppc64_cpu tool, which shouldn't do nothing more than
>>> write to sysfs.  but I just reproduced it with the script below.
>>>
>>> Note that this is ppc64le.  I don't have a x86 in hand to attempt to
>>> reproduce right now, but I'll look for one and see how it goes.
>>
>> Hi,
>>
>> Any luck on reproducing it?  We were initially reproducing with a
>> proprietary stress test, but I gave a try to a generated fio jobfile
>> associated with the SMT script I shared earlier and I could reproduce
>> the crash consistently in less than 10 minutes of execution.  this was
>> still ppc64le, though.  I couldn't get my hands on nvme on x86 yet.
>
> Nope, I have not been able to reproduce it. How long does the CPU
> offline/online actions take on ppc64? It's pretty slow on x86, which may
> hide the issue. I took out the various printk's associated with bringing
> a CPU off/online, as well as IRQ breaking parts, but didn't help in
> reproducing it.
>
>> The job file I used, as well as the smt.sh script, in case you want to
>> give it a try:
>>
>> jobfile: http://krisman.be/k/nvmejob.fio
>> smt.sh:  http://krisman.be/k/smt.sh
>>
>> Still, the trigger seems to be consistently a heavy load of IO
>> associated with CPU addition/removal.
>
> My workload looks similar to yours, in that it's high depth and with a
> lot of jobs to keep most CPUs loaded. My bash script is different than
> yours, I'll try that and see if it helps here.

Actually, I take that back. You're not using O_DIRECT, hence all your
jobs are running at QD=1, not the 256 specified. That looks odd, but
I'll try, maybe it'll hit something different.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Oops when completing request on the wrong queue
  2016-08-23 21:14             ` Jens Axboe
@ 2016-08-23 22:49               ` Keith Busch
  2016-08-24 18:34               ` Jens Axboe
  1 sibling, 0 replies; 12+ messages in thread
From: Keith Busch @ 2016-08-23 22:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Gabriel Krisman Bertazi, Christoph Hellwig, linux-nvme,
	Brian King, linux-block, linux-scsi

On Tue, Aug 23, 2016 at 03:14:23PM -0600, Jens Axboe wrote:
> On 08/23/2016 03:11 PM, Jens Axboe wrote:
> >My workload looks similar to yours, in that it's high depth and with a
> >lot of jobs to keep most CPUs loaded. My bash script is different than
> >yours, I'll try that and see if it helps here.
> 
> Actually, I take that back. You're not using O_DIRECT, hence all your
> jobs are running at QD=1, not the 256 specified. That looks odd, but
> I'll try, maybe it'll hit something different.

I haven't recreated this either, but I think I can logically see why
this failure is happening.

I sent an nvme driver patch earlier on this thread to exit the hardware
context, which I thought would do the trick if the hctx's tags were
being moved. That turns out to be wrong for a couple reasons.

First, we can't release the nvmeq->tags when a hctx exits because
that nvmeq may be used by other namespaces that need to point to
the device's tag set.

The other reason is that blk-mq doesn't exit or init hardware contexts
when remapping for a CPU event, leaving the nvme driver unaware a hardware
context points to a different tag set.

So I think I see why this test would fail; don't know about a fix yet.
Maybe the nvme driver needs some indirection instead of pointing
directly to the tagset after init_hctx.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Oops when completing request on the wrong queue
  2016-08-23 21:14             ` Jens Axboe
  2016-08-23 22:49               ` Keith Busch
@ 2016-08-24 18:34               ` Jens Axboe
  2016-08-24 20:36                 ` Jens Axboe
  1 sibling, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2016-08-24 18:34 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Keith Busch, Christoph Hellwig, linux-nvme, Brian King,
	linux-block, linux-scsi

On 08/23/2016 03:14 PM, Jens Axboe wrote:
> On 08/23/2016 03:11 PM, Jens Axboe wrote:
>> On 08/23/2016 02:54 PM, Gabriel Krisman Bertazi wrote:
>>> Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes:
>>>
>>>>> Can you share what you ran to online/offline CPUs? I can't reproduce
>>>>> this here.
>>>>
>>>> I was using the ppc64_cpu tool, which shouldn't do nothing more than
>>>> write to sysfs.  but I just reproduced it with the script below.
>>>>
>>>> Note that this is ppc64le.  I don't have a x86 in hand to attempt to
>>>> reproduce right now, but I'll look for one and see how it goes.
>>>
>>> Hi,
>>>
>>> Any luck on reproducing it?  We were initially reproducing with a
>>> proprietary stress test, but I gave a try to a generated fio jobfile
>>> associated with the SMT script I shared earlier and I could reproduce
>>> the crash consistently in less than 10 minutes of execution.  this was
>>> still ppc64le, though.  I couldn't get my hands on nvme on x86 yet.
>>
>> Nope, I have not been able to reproduce it. How long does the CPU
>> offline/online actions take on ppc64? It's pretty slow on x86, which may
>> hide the issue. I took out the various printk's associated with bringing
>> a CPU off/online, as well as IRQ breaking parts, but didn't help in
>> reproducing it.
>>
>>> The job file I used, as well as the smt.sh script, in case you want to
>>> give it a try:
>>>
>>> jobfile: http://krisman.be/k/nvmejob.fio
>>> smt.sh:  http://krisman.be/k/smt.sh
>>>
>>> Still, the trigger seems to be consistently a heavy load of IO
>>> associated with CPU addition/removal.
>>
>> My workload looks similar to yours, in that it's high depth and with a
>> lot of jobs to keep most CPUs loaded. My bash script is different than
>> yours, I'll try that and see if it helps here.
>
> Actually, I take that back. You're not using O_DIRECT, hence all your
> jobs are running at QD=1, not the 256 specified. That looks odd, but
> I'll try, maybe it'll hit something different.

Can you try this patch? It's not perfect, but I'll be interested if it
makes a difference for you.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 758a9b5..41def54 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -810,11 +810,11 @@ static void __blk_mq_run_hw_queue(struct 
blk_mq_hw_ctx *hctx)
  	struct list_head *dptr;
  	int queued;

-	WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask));
-
  	if (unlikely(test_bit(BLK_MQ_S_STOPPED, &hctx->state)))
  		return;

+	WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask));
+
  	hctx->run++;

  	/*
@@ -1075,15 +1075,11 @@ static void __blk_mq_insert_request(struct 
blk_mq_hw_ctx *hctx,
  }

  void blk_mq_insert_request(struct request *rq, bool at_head, bool 
run_queue,
-		bool async)
+			   bool async)
  {
+	struct blk_mq_ctx *ctx = rq->mq_ctx;
  	struct request_queue *q = rq->q;
  	struct blk_mq_hw_ctx *hctx;
-	struct blk_mq_ctx *ctx = rq->mq_ctx, *current_ctx;
-
-	current_ctx = blk_mq_get_ctx(q);
-	if (!cpu_online(ctx->cpu))
-		rq->mq_ctx = ctx = current_ctx;

  	hctx = q->mq_ops->map_queue(q, ctx->cpu);

@@ -1093,8 +1089,6 @@ void blk_mq_insert_request(struct request *rq, 
bool at_head, bool run_queue,

  	if (run_queue)
  		blk_mq_run_hw_queue(hctx, async);
-
-	blk_mq_put_ctx(current_ctx);
  }

  static void blk_mq_insert_requests(struct request_queue *q,
@@ -1105,14 +1099,9 @@ static void blk_mq_insert_requests(struct 
request_queue *q,

  {
  	struct blk_mq_hw_ctx *hctx;
-	struct blk_mq_ctx *current_ctx;

  	trace_block_unplug(q, depth, !from_schedule);

-	current_ctx = blk_mq_get_ctx(q);
-
-	if (!cpu_online(ctx->cpu))
-		ctx = current_ctx;
  	hctx = q->mq_ops->map_queue(q, ctx->cpu);

  	/*
@@ -1125,14 +1114,12 @@ static void blk_mq_insert_requests(struct 
request_queue *q,

  		rq = list_first_entry(list, struct request, queuelist);
  		list_del_init(&rq->queuelist);
-		rq->mq_ctx = ctx;
  		__blk_mq_insert_req_list(hctx, ctx, rq, false);
  	}
  	blk_mq_hctx_mark_pending(hctx, ctx);
  	spin_unlock(&ctx->lock);

  	blk_mq_run_hw_queue(hctx, from_schedule);
-	blk_mq_put_ctx(current_ctx);
  }

  static int plug_ctx_cmp(void *priv, struct list_head *a, struct 
list_head *b)
@@ -1692,6 +1679,11 @@ static int blk_mq_hctx_cpu_offline(struct 
blk_mq_hw_ctx *hctx, int cpu)
  	while (!list_empty(&tmp)) {
  		struct request *rq;

+		/*
+		 * FIXME: we can't just move the req here. We'd have to
+		 * pull off the bio chain and add it to a new request
+		 * on the target hw queue
+		 */
  		rq = list_first_entry(&tmp, struct request, queuelist);
  		rq->mq_ctx = ctx;
  		list_move_tail(&rq->queuelist, &ctx->rq_list);

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: Oops when completing request on the wrong queue
  2016-08-24 18:34               ` Jens Axboe
@ 2016-08-24 20:36                 ` Jens Axboe
  2016-08-29 18:06                   ` Gabriel Krisman Bertazi
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2016-08-24 20:36 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Keith Busch, Christoph Hellwig, linux-nvme, Brian King,
	linux-block, linux-scsi

On 08/24/2016 12:34 PM, Jens Axboe wrote:
> On 08/23/2016 03:14 PM, Jens Axboe wrote:
>> On 08/23/2016 03:11 PM, Jens Axboe wrote:
>>> On 08/23/2016 02:54 PM, Gabriel Krisman Bertazi wrote:
>>>> Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes:
>>>>
>>>>>> Can you share what you ran to online/offline CPUs? I can't reproduce
>>>>>> this here.
>>>>>
>>>>> I was using the ppc64_cpu tool, which shouldn't do nothing more than
>>>>> write to sysfs.  but I just reproduced it with the script below.
>>>>>
>>>>> Note that this is ppc64le.  I don't have a x86 in hand to attempt to
>>>>> reproduce right now, but I'll look for one and see how it goes.
>>>>
>>>> Hi,
>>>>
>>>> Any luck on reproducing it?  We were initially reproducing with a
>>>> proprietary stress test, but I gave a try to a generated fio jobfile
>>>> associated with the SMT script I shared earlier and I could reproduce
>>>> the crash consistently in less than 10 minutes of execution.  this was
>>>> still ppc64le, though.  I couldn't get my hands on nvme on x86 yet.
>>>
>>> Nope, I have not been able to reproduce it. How long does the CPU
>>> offline/online actions take on ppc64? It's pretty slow on x86, which may
>>> hide the issue. I took out the various printk's associated with bringing
>>> a CPU off/online, as well as IRQ breaking parts, but didn't help in
>>> reproducing it.
>>>
>>>> The job file I used, as well as the smt.sh script, in case you want to
>>>> give it a try:
>>>>
>>>> jobfile: http://krisman.be/k/nvmejob.fio
>>>> smt.sh:  http://krisman.be/k/smt.sh
>>>>
>>>> Still, the trigger seems to be consistently a heavy load of IO
>>>> associated with CPU addition/removal.
>>>
>>> My workload looks similar to yours, in that it's high depth and with a
>>> lot of jobs to keep most CPUs loaded. My bash script is different than
>>> yours, I'll try that and see if it helps here.
>>
>> Actually, I take that back. You're not using O_DIRECT, hence all your
>> jobs are running at QD=1, not the 256 specified. That looks odd, but
>> I'll try, maybe it'll hit something different.
>
> Can you try this patch? It's not perfect, but I'll be interested if it
> makes a difference for you.

This one should handle the WARN_ON() for running the hw queue on the
wrong CPU as well.


diff --git a/block/blk-mq.c b/block/blk-mq.c
index 758a9b5..b21a9b9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -810,11 +810,12 @@ static void __blk_mq_run_hw_queue(struct 
blk_mq_hw_ctx *hctx)
  	struct list_head *dptr;
  	int queued;

-	WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask));
-
  	if (unlikely(test_bit(BLK_MQ_S_STOPPED, &hctx->state)))
  		return;

+	WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
+		cpu_online(hctx->next_cpu));
+
  	hctx->run++;

  	/*
@@ -1075,15 +1082,11 @@ static void __blk_mq_insert_request(struct 
blk_mq_hw_ctx *hctx,
  }

  void blk_mq_insert_request(struct request *rq, bool at_head, bool 
run_queue,
-		bool async)
+			   bool async)
  {
+	struct blk_mq_ctx *ctx = rq->mq_ctx;
  	struct request_queue *q = rq->q;
  	struct blk_mq_hw_ctx *hctx;
-	struct blk_mq_ctx *ctx = rq->mq_ctx, *current_ctx;
-
-	current_ctx = blk_mq_get_ctx(q);
-	if (!cpu_online(ctx->cpu))
-		rq->mq_ctx = ctx = current_ctx;

  	hctx = q->mq_ops->map_queue(q, ctx->cpu);

@@ -1093,8 +1096,6 @@ void blk_mq_insert_request(struct request *rq, 
bool at_head, bool run_queue,

  	if (run_queue)
  		blk_mq_run_hw_queue(hctx, async);
-
-	blk_mq_put_ctx(current_ctx);
  }

  static void blk_mq_insert_requests(struct request_queue *q,
@@ -1105,14 +1106,9 @@ static void blk_mq_insert_requests(struct 
request_queue *q,

  {
  	struct blk_mq_hw_ctx *hctx;
-	struct blk_mq_ctx *current_ctx;

  	trace_block_unplug(q, depth, !from_schedule);

-	current_ctx = blk_mq_get_ctx(q);
-
-	if (!cpu_online(ctx->cpu))
-		ctx = current_ctx;
  	hctx = q->mq_ops->map_queue(q, ctx->cpu);

  	/*
@@ -1125,14 +1121,12 @@ static void blk_mq_insert_requests(struct 
request_queue *q,

  		rq = list_first_entry(list, struct request, queuelist);
  		list_del_init(&rq->queuelist);
-		rq->mq_ctx = ctx;
  		__blk_mq_insert_req_list(hctx, ctx, rq, false);
  	}
  	blk_mq_hctx_mark_pending(hctx, ctx);
  	spin_unlock(&ctx->lock);

  	blk_mq_run_hw_queue(hctx, from_schedule);
-	blk_mq_put_ctx(current_ctx);
  }

  static int plug_ctx_cmp(void *priv, struct list_head *a, struct 
list_head *b)
@@ -1692,6 +1686,11 @@ static int blk_mq_hctx_cpu_offline(struct 
blk_mq_hw_ctx *hctx, int cpu)
  	while (!list_empty(&tmp)) {
  		struct request *rq;

+		/*
+		 * FIXME: we can't just move the req here. We'd have to
+		 * pull off the bio chain and add it to a new request
+		 * on the target hw queue
+		 */
  		rq = list_first_entry(&tmp, struct request, queuelist);
  		rq->mq_ctx = ctx;
  		list_move_tail(&rq->queuelist, &ctx->rq_list);


-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: Oops when completing request on the wrong queue
  2016-08-24 20:36                 ` Jens Axboe
@ 2016-08-29 18:06                   ` Gabriel Krisman Bertazi
  2016-08-29 18:40                     ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Gabriel Krisman Bertazi @ 2016-08-29 18:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Christoph Hellwig, linux-nvme, Brian King,
	linux-block, linux-scsi

Jens Axboe <axboe@kernel.dk> writes:
>> Can you try this patch? It's not perfect, but I'll be interested if it
>> makes a difference for you.
>

Hi Jens,

Sorry for the delay.  I just got back to this and have been running your
patch on top of 4.8 without a crash for over 1 hour.  I wanna give it
more time to make sure it's running properly, though.

Let me get back to you after a few more rounds of test.

> This one should handle the WARN_ON() for running the hw queue on the
> wrong CPU as well.

On the workaround you added to prevent WARN_ON, we surely need to
prevent blk_mq_hctx_next_cpu from scheduling dead cpus in the first
place, right..  How do you feel about the following RFC?  I know it's
not a complete fix, but it feels like a good improvement to me.

http://www.spinics.net/lists/linux-scsi/msg98608.html

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Oops when completing request on the wrong queue
  2016-08-29 18:06                   ` Gabriel Krisman Bertazi
@ 2016-08-29 18:40                     ` Jens Axboe
  0 siblings, 0 replies; 12+ messages in thread
From: Jens Axboe @ 2016-08-29 18:40 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Keith Busch, Christoph Hellwig, linux-nvme, Brian King,
	linux-block, linux-scsi

On 08/29/2016 12:06 PM, Gabriel Krisman Bertazi wrote:
> Jens Axboe <axboe@kernel.dk> writes:
>>> Can you try this patch? It's not perfect, but I'll be interested if it
>>> makes a difference for you.
>>
>
> Hi Jens,
>
> Sorry for the delay.  I just got back to this and have been running your
> patch on top of 4.8 without a crash for over 1 hour.  I wanna give it
> more time to make sure it's running properly, though.
>
> Let me get back to you after a few more rounds of test.

Thanks, sounds good. The patches have landed in mainline too.

>> This one should handle the WARN_ON() for running the hw queue on the
>> wrong CPU as well.
>
> On the workaround you added to prevent WARN_ON, we surely need to
> prevent blk_mq_hctx_next_cpu from scheduling dead cpus in the first
> place, right..  How do you feel about the following RFC?  I know it's
> not a complete fix, but it feels like a good improvement to me.
>
> http://www.spinics.net/lists/linux-scsi/msg98608.html

But we can't completely prevent it, and I don't think we have to. I just
don't want to trigger a warning for something that's a valid condition.
I want the warning to trigger if this happens without the CPU going
offline, since then it's indicative of a real bug in the mapping. Your
patch isn't going to prevent it either - it'll shrink the window, at the
expense of making blk_mq_hctx_next_cpu() more expensive. So I don't
think it's worthwhile.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-08-29 18:40 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <87a8gltgks.fsf@linux.vnet.ibm.com>
2016-08-19 13:28 ` Oops when completing request on the wrong queue Gabriel Krisman Bertazi
2016-08-19 14:13   ` Jens Axboe
2016-08-19 15:51     ` Jens Axboe
2016-08-19 16:38       ` Gabriel Krisman Bertazi
2016-08-23 20:54         ` Gabriel Krisman Bertazi
2016-08-23 21:11           ` Jens Axboe
2016-08-23 21:14             ` Jens Axboe
2016-08-23 22:49               ` Keith Busch
2016-08-24 18:34               ` Jens Axboe
2016-08-24 20:36                 ` Jens Axboe
2016-08-29 18:06                   ` Gabriel Krisman Bertazi
2016-08-29 18:40                     ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).