* Re: Oops when completing request on the wrong queue [not found] <87a8gltgks.fsf@linux.vnet.ibm.com> @ 2016-08-19 13:28 ` Gabriel Krisman Bertazi 2016-08-19 14:13 ` Jens Axboe 0 siblings, 1 reply; 12+ messages in thread From: Gabriel Krisman Bertazi @ 2016-08-19 13:28 UTC (permalink / raw) To: Keith Busch Cc: Christoph Hellwig, linux-nvme, Jens Axboe, Brian King, linux-block, linux-scsi Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes: > We, IBM, have been experiencing eventual Oops when stressing IO at the > same time we add/remove processors. The Oops happens in the IRQ path, > when we try to complete a request that was apparently meant for another > queue. > > In __nvme_process_cq, the driver will use the cqe.command_id and the > nvmeq->tags to find out, via blk_mq_tag_to_rq, the request that > initiated the IO. Eventually, it happens that the request returned by > that function is not initialized, and we crash inside > __blk_mq_complete_request, as shown below. > > [ 2679.050701] Faulting instruction address: 0xc000000000549500 > [ 2679.050711] Oops: Kernel access of bad area, sig: 11 [#1] > [ 2679.050716] SMP NR_CPUS=2048 NUMA pSeries > [ 2679.050727] Modules linked in: minix nls_iso8859_1 xfs libcrc32c > rpcsec_gss_krb5 auth_rpcgss > nfsv4 nfs lockd grace fscache binfmt_misc pseries_rng rtc_generic sunrpc > autofs4 btrfs xor > raid6_pq ibmvscsi ibmveth nvme > [ 2679.050771] CPU: 21 PID: 45045 Comm: ppc64_cpu Not tainted 4.4.0-22-generic #40-Ubuntu > [ 2679.050780] task: c000000497b07e60 ti: c0000004fff2c000 task.ti: c000000451bc0000 > [ 2679.050786] NIP: c000000000549500 LR: d0000000046b5e84 CTR: c000000000549680 > [ 2679.050803] REGS: c0000004fff2fa80 TRAP: 0300 Not tainted (4.4.0-22-generic) > [ 2679.050807] MSR: 8000000100009033 <SF,EE,ME,IR,DR,RI,LE> CR: 28022428 XER: 00000000 > [ 2679.050824] CFAR: c000000000008468 DAR: 00000000000000b0 DSISR: 40000000 SOFTE: 0 > GPR00: d0000000046b5e84 c0000004fff2fd00 c0000000015b4300 c0000002799a0a00 > GPR04: 0000000000000000 0000000000000105 0000000000000004 0000000000000001 > GPR08: c0000002799a0a50 0000000000000000 0000000000000000 d0000000046bdc68 > GPR12: c000000000549680 c000000007b1c780 0000000000000008 0000000000000001 > GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000010005b50 > GPR20: 0000000010005dc8 c0000000015eaa60 0000000000000001 0000000000000002 > GPR24: 00000000000001e3 0000000000000000 0000000000000000 ffffffffffffffff > GPR28: 0000000000000000 0000000000000000 c0000004f8cd5180 c0000002799a0a00 > [ 2679.050908] NIP [c000000000549500] __blk_mq_complete_request+0x30/0x1b0 > [ 2679.050924] LR [d0000000046b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme] > [ 2679.050930] Call Trace: > [ 2679.050941] [c0000004fff2fd00] [c0000004fff2fd40] 0xc0000004fff2fd40 (unreliable) > [ 2679.050952] [c0000004fff2fd40] [d0000000046b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme] > [ 2679.050961] [c0000004fff2fde0] [d0000000046b613c] nvme_irq+0x3c/0xb0 [nvme] > [ 2679.050972] [c0000004fff2fe10] [c000000000130660] handle_irq_event_percpu+0xa0/0x320 > [ 2679.050981] [c0000004fff2fed0] [c000000000130948] handle_irq_event+0x68/0xc0 > [ 2679.050989] [c0000004fff2ff00] [c000000000135c2c] handle_fasteoi_irq+0xec/0x2b0 > [ 2679.050997] [c0000004fff2ff30] [c00000000012f844] generic_handle_irq+0x54/0x80 > [ 2679.051007] [c0000004fff2ff60] [c000000000011320] __do_irq+0x80/0x1d0 > [ 2679.051020] [c0000004fff2ff90] [c000000000024800] call_do_irq+0x14/0x24 > [ 2679.051037] [c000000451bc3350] [c000000000011508] do_IRQ+0x98/0x140 > [ 2679.051054] [c000000451bc33a0] [c000000000002594] > hardware_interrupt_common+0x114/0x180 > > I added some debugging which skipped the Oops and made me think that we > are fetching the wrong request because we are actually looking at the > wrong set of tags. > > The log below exemplifies what I am saying. Instead of crashing in the > Oops, I made the execution skip the request completion and simply > consider that odd CQE as handled. The first line of the log does that. > >> nvme nvme0: Skip completion of I/O 309 on queue 35 due to missing q >> nvme nvme0: I/O 309 QID 63 timeout, aborting >> nvme nvme0: Abort status: 0x0 >> nvme nvme0: I/O 309 QID 63 timeout, reset controller >> nvme nvme0: completing aborted command with status: fffffffc >> nvme 0003:03:00.0: Using 64-bit DMA iommu bypass >> blk_update_request: I/O error, dev nvme0n1, sector 2105537536 > > In the first line, we found the request 309 for queue 35, which would > have crashed the execution. My debug patch just ignores it. > > Then, we can see that eventually, the same IO, 309, will timeout in > another QID, which was actually expecting for it. The Abort request > gets sent and completed, but we never receive the completion of the > aborted request, thus we timeout again and restart the device. > > This only happens when we are changing the SMT settings very > frequently, which points back to the way we correlate the hctx->tags to > the nvmeq structure, but I failed to find the exact code spot were > things get messed up. > > I've put a some effort on this code lately, but I definitely need some > assistance on this one. Since, as far as I understand, this issue is > much more likely to reproduce on boxes with a lot of hotpluggable > processors, I'm willing to test any patches you may find suitable in our > systems. > Hi, [Adding linux-block, linux-scsi and Jens in CC for blk-mq input] Made some progress on tracking down the issue. It's looking more of a blk-mq bug than nvme to me now. Let me know if I made any wrong assumption below: I started to wonder whether blk-mq is requesting the IO in a wrong queue or if the request is being moved to another queue during hotplug, as it would happen in blk_mq_hctx_cpu_offline. I verified that when the fail occurs, the request never was never touched in blk_mq_hctx_cpu_offline. Also, and more interesting, I added the following BUG_ON to nvme_queue_req: BUG_ON(blk_mq_tag_to_rq(*nvmeq->tags, req->tag) != req); >From my experiments, I could never hit this during normal operation (without SMT). But, as soon as I start triggering hotplug changes, I hit it pretty fast. By changing to WARN_ON, I see that the Oops will happen almost immediately after this request is submitted, once __process_cq tries to complete it. Also, when adding the patch below, we send a few -EIOs back to upper layers, but it prevents the crash. I also think i saw a similar hang when running on a virtio_scsi disk. I couldn't reproduce it a second time, so we still lack confirmation, but if confirmed, it indeed points away from nvme and back to the block layer. Any suggestions to further track this down? diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index b927ecf..10f2257 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -599,6 +599,11 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx, if (ret) return ret; + if(blk_mq_tag_to_rq(*nvmeq->tags, req->tag) != req) { + ret = BLK_MQ_RQ_QUEUE_ERROR; + goto out; + } + ret = nvme_setup_cmd(ns, req, &cmnd); if (ret) goto out; ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: Oops when completing request on the wrong queue 2016-08-19 13:28 ` Oops when completing request on the wrong queue Gabriel Krisman Bertazi @ 2016-08-19 14:13 ` Jens Axboe 2016-08-19 15:51 ` Jens Axboe 0 siblings, 1 reply; 12+ messages in thread From: Jens Axboe @ 2016-08-19 14:13 UTC (permalink / raw) To: Gabriel Krisman Bertazi, Keith Busch Cc: Christoph Hellwig, linux-nvme, Brian King, linux-block, linux-scsi On 08/19/2016 07:28 AM, Gabriel Krisman Bertazi wrote: > Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes: > >> We, IBM, have been experiencing eventual Oops when stressing IO at the >> same time we add/remove processors. The Oops happens in the IRQ path, >> when we try to complete a request that was apparently meant for another >> queue. >> >> In __nvme_process_cq, the driver will use the cqe.command_id and the >> nvmeq->tags to find out, via blk_mq_tag_to_rq, the request that >> initiated the IO. Eventually, it happens that the request returned by >> that function is not initialized, and we crash inside >> __blk_mq_complete_request, as shown below. >> >> [ 2679.050701] Faulting instruction address: 0xc000000000549500 >> [ 2679.050711] Oops: Kernel access of bad area, sig: 11 [#1] >> [ 2679.050716] SMP NR_CPUS=2048 NUMA pSeries >> [ 2679.050727] Modules linked in: minix nls_iso8859_1 xfs libcrc32c >> rpcsec_gss_krb5 auth_rpcgss >> nfsv4 nfs lockd grace fscache binfmt_misc pseries_rng rtc_generic sunrpc >> autofs4 btrfs xor >> raid6_pq ibmvscsi ibmveth nvme >> [ 2679.050771] CPU: 21 PID: 45045 Comm: ppc64_cpu Not tainted 4.4.0-22-generic #40-Ubuntu >> [ 2679.050780] task: c000000497b07e60 ti: c0000004fff2c000 task.ti: c000000451bc0000 >> [ 2679.050786] NIP: c000000000549500 LR: d0000000046b5e84 CTR: c000000000549680 >> [ 2679.050803] REGS: c0000004fff2fa80 TRAP: 0300 Not tainted (4.4.0-22-generic) >> [ 2679.050807] MSR: 8000000100009033 <SF,EE,ME,IR,DR,RI,LE> CR: 28022428 XER: 00000000 >> [ 2679.050824] CFAR: c000000000008468 DAR: 00000000000000b0 DSISR: 40000000 SOFTE: 0 >> GPR00: d0000000046b5e84 c0000004fff2fd00 c0000000015b4300 c0000002799a0a00 >> GPR04: 0000000000000000 0000000000000105 0000000000000004 0000000000000001 >> GPR08: c0000002799a0a50 0000000000000000 0000000000000000 d0000000046bdc68 >> GPR12: c000000000549680 c000000007b1c780 0000000000000008 0000000000000001 >> GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000010005b50 >> GPR20: 0000000010005dc8 c0000000015eaa60 0000000000000001 0000000000000002 >> GPR24: 00000000000001e3 0000000000000000 0000000000000000 ffffffffffffffff >> GPR28: 0000000000000000 0000000000000000 c0000004f8cd5180 c0000002799a0a00 >> [ 2679.050908] NIP [c000000000549500] __blk_mq_complete_request+0x30/0x1b0 >> [ 2679.050924] LR [d0000000046b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme] >> [ 2679.050930] Call Trace: >> [ 2679.050941] [c0000004fff2fd00] [c0000004fff2fd40] 0xc0000004fff2fd40 (unreliable) >> [ 2679.050952] [c0000004fff2fd40] [d0000000046b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme] >> [ 2679.050961] [c0000004fff2fde0] [d0000000046b613c] nvme_irq+0x3c/0xb0 [nvme] >> [ 2679.050972] [c0000004fff2fe10] [c000000000130660] handle_irq_event_percpu+0xa0/0x320 >> [ 2679.050981] [c0000004fff2fed0] [c000000000130948] handle_irq_event+0x68/0xc0 >> [ 2679.050989] [c0000004fff2ff00] [c000000000135c2c] handle_fasteoi_irq+0xec/0x2b0 >> [ 2679.050997] [c0000004fff2ff30] [c00000000012f844] generic_handle_irq+0x54/0x80 >> [ 2679.051007] [c0000004fff2ff60] [c000000000011320] __do_irq+0x80/0x1d0 >> [ 2679.051020] [c0000004fff2ff90] [c000000000024800] call_do_irq+0x14/0x24 >> [ 2679.051037] [c000000451bc3350] [c000000000011508] do_IRQ+0x98/0x140 >> [ 2679.051054] [c000000451bc33a0] [c000000000002594] >> hardware_interrupt_common+0x114/0x180 >> >> I added some debugging which skipped the Oops and made me think that we >> are fetching the wrong request because we are actually looking at the >> wrong set of tags. >> >> The log below exemplifies what I am saying. Instead of crashing in the >> Oops, I made the execution skip the request completion and simply >> consider that odd CQE as handled. The first line of the log does that. >> >>> nvme nvme0: Skip completion of I/O 309 on queue 35 due to missing q >>> nvme nvme0: I/O 309 QID 63 timeout, aborting >>> nvme nvme0: Abort status: 0x0 >>> nvme nvme0: I/O 309 QID 63 timeout, reset controller >>> nvme nvme0: completing aborted command with status: fffffffc >>> nvme 0003:03:00.0: Using 64-bit DMA iommu bypass >>> blk_update_request: I/O error, dev nvme0n1, sector 2105537536 >> >> In the first line, we found the request 309 for queue 35, which would >> have crashed the execution. My debug patch just ignores it. >> >> Then, we can see that eventually, the same IO, 309, will timeout in >> another QID, which was actually expecting for it. The Abort request >> gets sent and completed, but we never receive the completion of the >> aborted request, thus we timeout again and restart the device. >> >> This only happens when we are changing the SMT settings very >> frequently, which points back to the way we correlate the hctx->tags to >> the nvmeq structure, but I failed to find the exact code spot were >> things get messed up. >> >> I've put a some effort on this code lately, but I definitely need some >> assistance on this one. Since, as far as I understand, this issue is >> much more likely to reproduce on boxes with a lot of hotpluggable >> processors, I'm willing to test any patches you may find suitable in our >> systems. >> > > Hi, > > [Adding linux-block, linux-scsi and Jens in CC for blk-mq input] > > Made some progress on tracking down the issue. It's looking more of a > blk-mq bug than nvme to me now. Let me know if I made any wrong > assumption below: > > I started to wonder whether blk-mq is requesting the IO in a wrong queue > or if the request is being moved to another queue during hotplug, as it > would happen in blk_mq_hctx_cpu_offline. > > I verified that when the fail occurs, the request never was never > touched in blk_mq_hctx_cpu_offline. Also, and more interesting, I added > the following BUG_ON to nvme_queue_req: > > BUG_ON(blk_mq_tag_to_rq(*nvmeq->tags, req->tag) != req); > > From my experiments, I could never hit this during normal operation > (without SMT). But, as soon as I start triggering hotplug changes, I hit > it pretty fast. By changing to WARN_ON, I see that the Oops will happen > almost immediately after this request is submitted, once __process_cq > tries to complete it. > > Also, when adding the patch below, we send a few -EIOs back to upper > layers, but it prevents the crash. > > I also think i saw a similar hang when running on a virtio_scsi disk. I > couldn't reproduce it a second time, so we still lack confirmation, but > if confirmed, it indeed points away from nvme and back to the block > layer. > > Any suggestions to further track this down? Some good detective work so far! I agree, this looks like a blk-mq core bug. Do you have a trace of a BUG() triggering in nvme_queue_rq(), when req->tag != nvmeq->tags? I don't immediately see how this could happen, the freezing should protect us from this, unless it's broken somehow. I'll take a look at this. -- Jens Axboe ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Oops when completing request on the wrong queue 2016-08-19 14:13 ` Jens Axboe @ 2016-08-19 15:51 ` Jens Axboe 2016-08-19 16:38 ` Gabriel Krisman Bertazi 0 siblings, 1 reply; 12+ messages in thread From: Jens Axboe @ 2016-08-19 15:51 UTC (permalink / raw) To: Gabriel Krisman Bertazi, Keith Busch Cc: Christoph Hellwig, linux-nvme, Brian King, linux-block, linux-scsi On 08/19/2016 08:13 AM, Jens Axboe wrote: > On 08/19/2016 07:28 AM, Gabriel Krisman Bertazi wrote: >> Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes: >> >>> We, IBM, have been experiencing eventual Oops when stressing IO at the >>> same time we add/remove processors. The Oops happens in the IRQ path, >>> when we try to complete a request that was apparently meant for another >>> queue. >>> >>> In __nvme_process_cq, the driver will use the cqe.command_id and the >>> nvmeq->tags to find out, via blk_mq_tag_to_rq, the request that >>> initiated the IO. Eventually, it happens that the request returned by >>> that function is not initialized, and we crash inside >>> __blk_mq_complete_request, as shown below. >>> >>> [ 2679.050701] Faulting instruction address: 0xc000000000549500 >>> [ 2679.050711] Oops: Kernel access of bad area, sig: 11 [#1] >>> [ 2679.050716] SMP NR_CPUS=2048 NUMA pSeries >>> [ 2679.050727] Modules linked in: minix nls_iso8859_1 xfs libcrc32c >>> rpcsec_gss_krb5 auth_rpcgss >>> nfsv4 nfs lockd grace fscache binfmt_misc pseries_rng rtc_generic sunrpc >>> autofs4 btrfs xor >>> raid6_pq ibmvscsi ibmveth nvme >>> [ 2679.050771] CPU: 21 PID: 45045 Comm: ppc64_cpu Not tainted >>> 4.4.0-22-generic #40-Ubuntu >>> [ 2679.050780] task: c000000497b07e60 ti: c0000004fff2c000 task.ti: >>> c000000451bc0000 >>> [ 2679.050786] NIP: c000000000549500 LR: d0000000046b5e84 CTR: >>> c000000000549680 >>> [ 2679.050803] REGS: c0000004fff2fa80 TRAP: 0300 Not tainted >>> (4.4.0-22-generic) >>> [ 2679.050807] MSR: 8000000100009033 <SF,EE,ME,IR,DR,RI,LE> CR: >>> 28022428 XER: 00000000 >>> [ 2679.050824] CFAR: c000000000008468 DAR: 00000000000000b0 DSISR: >>> 40000000 SOFTE: 0 >>> GPR00: d0000000046b5e84 c0000004fff2fd00 c0000000015b4300 >>> c0000002799a0a00 >>> GPR04: 0000000000000000 0000000000000105 0000000000000004 >>> 0000000000000001 >>> GPR08: c0000002799a0a50 0000000000000000 0000000000000000 >>> d0000000046bdc68 >>> GPR12: c000000000549680 c000000007b1c780 0000000000000008 >>> 0000000000000001 >>> GPR16: 0000000000000000 0000000000000000 0000000000000000 >>> 0000000010005b50 >>> GPR20: 0000000010005dc8 c0000000015eaa60 0000000000000001 >>> 0000000000000002 >>> GPR24: 00000000000001e3 0000000000000000 0000000000000000 >>> ffffffffffffffff >>> GPR28: 0000000000000000 0000000000000000 c0000004f8cd5180 >>> c0000002799a0a00 >>> [ 2679.050908] NIP [c000000000549500] >>> __blk_mq_complete_request+0x30/0x1b0 >>> [ 2679.050924] LR [d0000000046b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme] >>> [ 2679.050930] Call Trace: >>> [ 2679.050941] [c0000004fff2fd00] [c0000004fff2fd40] >>> 0xc0000004fff2fd40 (unreliable) >>> [ 2679.050952] [c0000004fff2fd40] [d0000000046b5e84] >>> __nvme_process_cq+0xf4/0x2c0 [nvme] >>> [ 2679.050961] [c0000004fff2fde0] [d0000000046b613c] >>> nvme_irq+0x3c/0xb0 [nvme] >>> [ 2679.050972] [c0000004fff2fe10] [c000000000130660] >>> handle_irq_event_percpu+0xa0/0x320 >>> [ 2679.050981] [c0000004fff2fed0] [c000000000130948] >>> handle_irq_event+0x68/0xc0 >>> [ 2679.050989] [c0000004fff2ff00] [c000000000135c2c] >>> handle_fasteoi_irq+0xec/0x2b0 >>> [ 2679.050997] [c0000004fff2ff30] [c00000000012f844] >>> generic_handle_irq+0x54/0x80 >>> [ 2679.051007] [c0000004fff2ff60] [c000000000011320] __do_irq+0x80/0x1d0 >>> [ 2679.051020] [c0000004fff2ff90] [c000000000024800] >>> call_do_irq+0x14/0x24 >>> [ 2679.051037] [c000000451bc3350] [c000000000011508] do_IRQ+0x98/0x140 >>> [ 2679.051054] [c000000451bc33a0] [c000000000002594] >>> hardware_interrupt_common+0x114/0x180 >>> >>> I added some debugging which skipped the Oops and made me think that we >>> are fetching the wrong request because we are actually looking at the >>> wrong set of tags. >>> >>> The log below exemplifies what I am saying. Instead of crashing in the >>> Oops, I made the execution skip the request completion and simply >>> consider that odd CQE as handled. The first line of the log does that. >>> >>>> nvme nvme0: Skip completion of I/O 309 on queue 35 due to missing q >>>> nvme nvme0: I/O 309 QID 63 timeout, aborting >>>> nvme nvme0: Abort status: 0x0 >>>> nvme nvme0: I/O 309 QID 63 timeout, reset controller >>>> nvme nvme0: completing aborted command with status: fffffffc >>>> nvme 0003:03:00.0: Using 64-bit DMA iommu bypass >>>> blk_update_request: I/O error, dev nvme0n1, sector 2105537536 >>> >>> In the first line, we found the request 309 for queue 35, which would >>> have crashed the execution. My debug patch just ignores it. >>> >>> Then, we can see that eventually, the same IO, 309, will timeout in >>> another QID, which was actually expecting for it. The Abort request >>> gets sent and completed, but we never receive the completion of the >>> aborted request, thus we timeout again and restart the device. >>> >>> This only happens when we are changing the SMT settings very >>> frequently, which points back to the way we correlate the hctx->tags to >>> the nvmeq structure, but I failed to find the exact code spot were >>> things get messed up. >>> >>> I've put a some effort on this code lately, but I definitely need some >>> assistance on this one. Since, as far as I understand, this issue is >>> much more likely to reproduce on boxes with a lot of hotpluggable >>> processors, I'm willing to test any patches you may find suitable in our >>> systems. >>> >> >> Hi, >> >> [Adding linux-block, linux-scsi and Jens in CC for blk-mq input] >> >> Made some progress on tracking down the issue. It's looking more of a >> blk-mq bug than nvme to me now. Let me know if I made any wrong >> assumption below: >> >> I started to wonder whether blk-mq is requesting the IO in a wrong queue >> or if the request is being moved to another queue during hotplug, as it >> would happen in blk_mq_hctx_cpu_offline. >> >> I verified that when the fail occurs, the request never was never >> touched in blk_mq_hctx_cpu_offline. Also, and more interesting, I added >> the following BUG_ON to nvme_queue_req: >> >> BUG_ON(blk_mq_tag_to_rq(*nvmeq->tags, req->tag) != req); >> >> From my experiments, I could never hit this during normal operation >> (without SMT). But, as soon as I start triggering hotplug changes, I hit >> it pretty fast. By changing to WARN_ON, I see that the Oops will happen >> almost immediately after this request is submitted, once __process_cq >> tries to complete it. >> >> Also, when adding the patch below, we send a few -EIOs back to upper >> layers, but it prevents the crash. >> >> I also think i saw a similar hang when running on a virtio_scsi disk. I >> couldn't reproduce it a second time, so we still lack confirmation, but >> if confirmed, it indeed points away from nvme and back to the block >> layer. >> >> Any suggestions to further track this down? > > Some good detective work so far! I agree, this looks like a blk-mq core > bug. Do you have a trace of a BUG() triggering in nvme_queue_rq(), when > req->tag != nvmeq->tags? I don't immediately see how this could happen, > the freezing should protect us from this, unless it's broken somehow. > I'll take a look at this. Can you share what you ran to online/offline CPUs? I can't reproduce this here. -- Jens Axboe ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Oops when completing request on the wrong queue 2016-08-19 15:51 ` Jens Axboe @ 2016-08-19 16:38 ` Gabriel Krisman Bertazi 2016-08-23 20:54 ` Gabriel Krisman Bertazi 0 siblings, 1 reply; 12+ messages in thread From: Gabriel Krisman Bertazi @ 2016-08-19 16:38 UTC (permalink / raw) To: Jens Axboe Cc: Keith Busch, Christoph Hellwig, linux-nvme, Brian King, linux-block, linux-scsi Jens Axboe <axboe@kernel.dk> writes: >> Some good detective work so far! I agree, this looks like a blk-mq core >> bug. Do you have a trace of a BUG() triggering in nvme_queue_rq(), when >> req->tag != nvmeq->tags? I don't immediately see how this could happen, >> the freezing should protect us from this, unless it's broken somehow. >> I'll take a look at this. Nothing much on the BUG() stack trace, at least to me. We are coming from an async execution of __blk_mq_run_hw_queue: This is from 4.7. I'll give another run on 4.8-rc1. [ 1953.910860] kernel BUG at ../drivers/nvme/host/pci.c:602! cpu 0x48: Vector: 700 (Program Check) at [c000007870d3b7a0] pc: d000000035aa2e88: nvme_queue_rq+0x1a8/0x7a0 [nvme] lr: d000000035aa2e78: nvme_queue_rq+0x198/0x7a0 [nvme] sp: c000007870d3ba20 msr: 900000010282b033 current = 0xc000007870d9e400 paca = 0xc000000002bd8800 softe: 0 irq_happened: 0x01 pid = 9205, comm = kworker/72:1H kernel BUG at ../drivers/nvme/host/pci.c:602! Linux version 4.7.0.mainline+ (root@iod76) (gcc version 5.3.1 20160413 (Ubuntu/IBM 5.3.1-14ubuntu2. 1) ) #150 SMP Wed Aug 17 14:53:47 CDT 2016 enter ? for help [c000007870d3bb10] c0000000005697b4 __blk_mq_run_hw_queue+0x384/0x640 [c000007870d3bc50] c0000000000e8cf0 process_one_work+0x1e0/0x590 [c000007870d3bce0] c0000000000e9148 worker_thread+0xa8/0x660 [c000007870d3bd80] c0000000000f2090 kthread+0x110/0x130 [c000007870d3be30] c0000000000095f0 ret_from_kernel_thread+0x5c/0x6c 48:mon> Another interesting data point is that we always reach the WARN_ON in __blk_mq_run_hw_queue immediately before submitting the bad request. Maybe we are touching some percpu variable from the wrong cpu? See the WARN_ON trace below. [ 369.953825] ------------[ cut here ]------------ [ 369.954011] WARNING: CPU: 142 PID: 9475 at ../block/blk-mq.c:840 __blk_mq_run_hw_queue+0x338/0x6 40 [ 369.954139] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache binfmt_ misc xfs ofpart cmdlinepart powernv_flash mtd ipmi_powernv ipmi_msghandler at24 opal_prd ibmpowernv nvmem_core powernv_rng input_leds joydev uio_pdrv_genirq mac_hid uio ib_iser rdma_cm sunrpc iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 ra id456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm hid_generic uas usb_storage usbhid hid nvme ahci libahci nvme_core drbg ansi_cprng [ 369.955011] CPU: 142 PID: 9475 Comm: kworker/142:1H Not tainted 4.7.0.mainline+ #150 [ 369.955085] Workqueue: kblockd blk_mq_run_work_fn [ 369.955153] task: c0000078cbb89a00 ti: c0000078cb024000 task.ti: c0000078cb024000 [ 369.955253] NIP: c000000000569768 LR: c000000000569668 CTR: 0000000000000000 [ 369.955314] REGS: c0000078cb027890 TRAP: 0700 Not tainted (4.7.0.mainline+) [ 369.955372] MSR: 900000010282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 28002224 XER: 20000000 [ 369.955579] CFAR: c000000000569a18 SOFTE: 1 GPR00: c0000000005695b0 c0000078cb027b10 c0000000013d0b00 0000000000000008 GPR04: 0000000000000008 0000000000000040 0000000000000000 0000000000000001 GPR08: c0000078c345d800 0000000000000001 0000000001700000 9000000102803003 GPR12: 0000000000002200 c000000002bffe00 c0000078bb96b000 0000000000000000 GPR16: 0000000000000008 c0000078bb930000 c000007954aafd00 c0000078bb961800 GPR20: c000007954aafd08 c000000000da3728 c0000078bea80000 c0000078cb027b88 GPR24: 0000000000000000 c0000078bea80000 c0000078bb961808 fffffffffffffffb GPR28: 0000000000000000 c0000078cb027b98 0000000000000000 c0000078bd299700 [ 369.956334] NIP [c000000000569768] __blk_mq_run_hw_queue+0x338/0x640 [ 369.956383] LR [c000000000569668] __blk_mq_run_hw_queue+0x238/0x640 [ 369.956433] Call Trace: [ 369.956454] [c0000078cb027b10] [c0000000005695b0] __blk_mq_run_hw_queue+0x180/0x640 (unreliable) [ 369.956535] [c0000078cb027c50] [c0000000000e8cf0] process_one_work+0x1e0/0x590 [ 369.956605] [c0000078cb027ce0] [c0000000000e9148] worker_thread+0xa8/0x660 [ 369.956664] [c0000078cb027d80] [c0000000000f2090] kthread+0x110/0x130 [ 369.956723] [c0000078cb027e30] [c0000000000095f0] ret_from_kernel_thread+0x5c/0x6c [ 369.956791] Instruction dump: [ 369.956821] 815f0108 e91302d8 81280000 7f895040 409d01e8 e9280060 794a1f24 7d29502a [ 369.956920] 7d2afa78 312affff 7d295110 7d2907b4 <0b090000> e9210088 e95f0058 38810060 [ 369.957020] ---[ end trace 1398dd60e3057065 ]--- [ 369.957094] ------------[ cut here ]------------ > Can you share what you ran to online/offline CPUs? I can't reproduce > this here. I was using the ppc64_cpu tool, which shouldn't do nothing more than write to sysfs. but I just reproduced it with the script below. Note that this is ppc64le. I don't have a x86 in hand to attempt to reproduce right now, but I'll look for one and see how it goes. #!/bin/bash MAXCPUs=159 STATE=1 while true; do for i in $(seq 1 ${MAXCPUS}); do if (($i%8)); then echo $STATE > /sys/devices/system/cpu/cpu$i/online fi done if [[ $STATE -eq 1 ]]; then STATE=0 else STATE=1 fi done -- Gabriel Krisman Bertazi ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Oops when completing request on the wrong queue 2016-08-19 16:38 ` Gabriel Krisman Bertazi @ 2016-08-23 20:54 ` Gabriel Krisman Bertazi 2016-08-23 21:11 ` Jens Axboe 0 siblings, 1 reply; 12+ messages in thread From: Gabriel Krisman Bertazi @ 2016-08-23 20:54 UTC (permalink / raw) To: Jens Axboe Cc: Keith Busch, Christoph Hellwig, linux-nvme, Brian King, linux-block, linux-scsi Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes: >> Can you share what you ran to online/offline CPUs? I can't reproduce >> this here. > > I was using the ppc64_cpu tool, which shouldn't do nothing more than > write to sysfs. but I just reproduced it with the script below. > > Note that this is ppc64le. I don't have a x86 in hand to attempt to > reproduce right now, but I'll look for one and see how it goes. Hi, Any luck on reproducing it? We were initially reproducing with a proprietary stress test, but I gave a try to a generated fio jobfile associated with the SMT script I shared earlier and I could reproduce the crash consistently in less than 10 minutes of execution. this was still ppc64le, though. I couldn't get my hands on nvme on x86 yet. The job file I used, as well as the smt.sh script, in case you want to give it a try: jobfile: http://krisman.be/k/nvmejob.fio smt.sh: http://krisman.be/k/smt.sh Still, the trigger seems to be consistently a heavy load of IO associated with CPU addition/removal. Let me share my progress from the last couple days in the hope that it rings a bell for you. Firstly, I verified that when we hit the BUG_ON in nvme_queue_rq, the request_queue's freeze_depth is 0, which points away from a fault in the freeze/unfreeze mechanism. If a request was escaping and going through the block layer during a freeze, we'd see freeze_depth >= 1. Before that, I had also tried to keep the q_usage_counter in atomic mode, in case of a bug in the percpu refcount. No luck, the BUG_ON was still hit. Also, I don't see anything special about the request that reaches the BUG_ON. It's a REQ_TYPE_FS request and, at least in the last time I reproduced, it was a READ that came from the stress test task through submit_bio. So nothing remarkable about it too, as far as I can see. I'm still thinking about a case in which the mapping get's screwed up, where a ctx would appear into two hctxs bitmaps after a remap, or if the ctx got remaped to another hctx. I'm still learning my way through the cpumap code, so I'm not sure it's a real possibility, but I'm not convinced it isn't. Some preliminary tests don't suggest it's the case at play, but I wanna spend a little more time on this theory (maybe for my lack of better ideas :) On a side note, probably unrelated to this crash, it also got me thinking about the current usefulness of blk_mq_hctx_notify. Since CPU is dead, no more requests would be coming through its ctx. I think we could force a queue run in blk_mq_queue_reinit_notify, before remapping, which would cause the hctx to fetch the remaining requests from that dead ctx (since it's not unmapped yet). This way, we could maintain a single hotplug notification hook and simplify the hotplug path. I haven't written code for it yet, but I'll see if I can come up with something and send to the list. -- Gabriel Krisman Bertazi ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Oops when completing request on the wrong queue 2016-08-23 20:54 ` Gabriel Krisman Bertazi @ 2016-08-23 21:11 ` Jens Axboe 2016-08-23 21:14 ` Jens Axboe 0 siblings, 1 reply; 12+ messages in thread From: Jens Axboe @ 2016-08-23 21:11 UTC (permalink / raw) To: Gabriel Krisman Bertazi Cc: Keith Busch, Christoph Hellwig, linux-nvme, Brian King, linux-block, linux-scsi On 08/23/2016 02:54 PM, Gabriel Krisman Bertazi wrote: > Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes: > >>> Can you share what you ran to online/offline CPUs? I can't reproduce >>> this here. >> >> I was using the ppc64_cpu tool, which shouldn't do nothing more than >> write to sysfs. but I just reproduced it with the script below. >> >> Note that this is ppc64le. I don't have a x86 in hand to attempt to >> reproduce right now, but I'll look for one and see how it goes. > > Hi, > > Any luck on reproducing it? We were initially reproducing with a > proprietary stress test, but I gave a try to a generated fio jobfile > associated with the SMT script I shared earlier and I could reproduce > the crash consistently in less than 10 minutes of execution. this was > still ppc64le, though. I couldn't get my hands on nvme on x86 yet. Nope, I have not been able to reproduce it. How long does the CPU offline/online actions take on ppc64? It's pretty slow on x86, which may hide the issue. I took out the various printk's associated with bringing a CPU off/online, as well as IRQ breaking parts, but didn't help in reproducing it. > The job file I used, as well as the smt.sh script, in case you want to > give it a try: > > jobfile: http://krisman.be/k/nvmejob.fio > smt.sh: http://krisman.be/k/smt.sh > > Still, the trigger seems to be consistently a heavy load of IO > associated with CPU addition/removal. My workload looks similar to yours, in that it's high depth and with a lot of jobs to keep most CPUs loaded. My bash script is different than yours, I'll try that and see if it helps here. -- Jens Axboe ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Oops when completing request on the wrong queue 2016-08-23 21:11 ` Jens Axboe @ 2016-08-23 21:14 ` Jens Axboe 2016-08-23 22:49 ` Keith Busch 2016-08-24 18:34 ` Jens Axboe 0 siblings, 2 replies; 12+ messages in thread From: Jens Axboe @ 2016-08-23 21:14 UTC (permalink / raw) To: Gabriel Krisman Bertazi Cc: Keith Busch, Christoph Hellwig, linux-nvme, Brian King, linux-block, linux-scsi On 08/23/2016 03:11 PM, Jens Axboe wrote: > On 08/23/2016 02:54 PM, Gabriel Krisman Bertazi wrote: >> Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes: >> >>>> Can you share what you ran to online/offline CPUs? I can't reproduce >>>> this here. >>> >>> I was using the ppc64_cpu tool, which shouldn't do nothing more than >>> write to sysfs. but I just reproduced it with the script below. >>> >>> Note that this is ppc64le. I don't have a x86 in hand to attempt to >>> reproduce right now, but I'll look for one and see how it goes. >> >> Hi, >> >> Any luck on reproducing it? We were initially reproducing with a >> proprietary stress test, but I gave a try to a generated fio jobfile >> associated with the SMT script I shared earlier and I could reproduce >> the crash consistently in less than 10 minutes of execution. this was >> still ppc64le, though. I couldn't get my hands on nvme on x86 yet. > > Nope, I have not been able to reproduce it. How long does the CPU > offline/online actions take on ppc64? It's pretty slow on x86, which may > hide the issue. I took out the various printk's associated with bringing > a CPU off/online, as well as IRQ breaking parts, but didn't help in > reproducing it. > >> The job file I used, as well as the smt.sh script, in case you want to >> give it a try: >> >> jobfile: http://krisman.be/k/nvmejob.fio >> smt.sh: http://krisman.be/k/smt.sh >> >> Still, the trigger seems to be consistently a heavy load of IO >> associated with CPU addition/removal. > > My workload looks similar to yours, in that it's high depth and with a > lot of jobs to keep most CPUs loaded. My bash script is different than > yours, I'll try that and see if it helps here. Actually, I take that back. You're not using O_DIRECT, hence all your jobs are running at QD=1, not the 256 specified. That looks odd, but I'll try, maybe it'll hit something different. -- Jens Axboe ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Oops when completing request on the wrong queue 2016-08-23 21:14 ` Jens Axboe @ 2016-08-23 22:49 ` Keith Busch 2016-08-24 18:34 ` Jens Axboe 1 sibling, 0 replies; 12+ messages in thread From: Keith Busch @ 2016-08-23 22:49 UTC (permalink / raw) To: Jens Axboe Cc: Gabriel Krisman Bertazi, Christoph Hellwig, linux-nvme, Brian King, linux-block, linux-scsi On Tue, Aug 23, 2016 at 03:14:23PM -0600, Jens Axboe wrote: > On 08/23/2016 03:11 PM, Jens Axboe wrote: > >My workload looks similar to yours, in that it's high depth and with a > >lot of jobs to keep most CPUs loaded. My bash script is different than > >yours, I'll try that and see if it helps here. > > Actually, I take that back. You're not using O_DIRECT, hence all your > jobs are running at QD=1, not the 256 specified. That looks odd, but > I'll try, maybe it'll hit something different. I haven't recreated this either, but I think I can logically see why this failure is happening. I sent an nvme driver patch earlier on this thread to exit the hardware context, which I thought would do the trick if the hctx's tags were being moved. That turns out to be wrong for a couple reasons. First, we can't release the nvmeq->tags when a hctx exits because that nvmeq may be used by other namespaces that need to point to the device's tag set. The other reason is that blk-mq doesn't exit or init hardware contexts when remapping for a CPU event, leaving the nvme driver unaware a hardware context points to a different tag set. So I think I see why this test would fail; don't know about a fix yet. Maybe the nvme driver needs some indirection instead of pointing directly to the tagset after init_hctx. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Oops when completing request on the wrong queue 2016-08-23 21:14 ` Jens Axboe 2016-08-23 22:49 ` Keith Busch @ 2016-08-24 18:34 ` Jens Axboe 2016-08-24 20:36 ` Jens Axboe 1 sibling, 1 reply; 12+ messages in thread From: Jens Axboe @ 2016-08-24 18:34 UTC (permalink / raw) To: Gabriel Krisman Bertazi Cc: Keith Busch, Christoph Hellwig, linux-nvme, Brian King, linux-block, linux-scsi On 08/23/2016 03:14 PM, Jens Axboe wrote: > On 08/23/2016 03:11 PM, Jens Axboe wrote: >> On 08/23/2016 02:54 PM, Gabriel Krisman Bertazi wrote: >>> Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes: >>> >>>>> Can you share what you ran to online/offline CPUs? I can't reproduce >>>>> this here. >>>> >>>> I was using the ppc64_cpu tool, which shouldn't do nothing more than >>>> write to sysfs. but I just reproduced it with the script below. >>>> >>>> Note that this is ppc64le. I don't have a x86 in hand to attempt to >>>> reproduce right now, but I'll look for one and see how it goes. >>> >>> Hi, >>> >>> Any luck on reproducing it? We were initially reproducing with a >>> proprietary stress test, but I gave a try to a generated fio jobfile >>> associated with the SMT script I shared earlier and I could reproduce >>> the crash consistently in less than 10 minutes of execution. this was >>> still ppc64le, though. I couldn't get my hands on nvme on x86 yet. >> >> Nope, I have not been able to reproduce it. How long does the CPU >> offline/online actions take on ppc64? It's pretty slow on x86, which may >> hide the issue. I took out the various printk's associated with bringing >> a CPU off/online, as well as IRQ breaking parts, but didn't help in >> reproducing it. >> >>> The job file I used, as well as the smt.sh script, in case you want to >>> give it a try: >>> >>> jobfile: http://krisman.be/k/nvmejob.fio >>> smt.sh: http://krisman.be/k/smt.sh >>> >>> Still, the trigger seems to be consistently a heavy load of IO >>> associated with CPU addition/removal. >> >> My workload looks similar to yours, in that it's high depth and with a >> lot of jobs to keep most CPUs loaded. My bash script is different than >> yours, I'll try that and see if it helps here. > > Actually, I take that back. You're not using O_DIRECT, hence all your > jobs are running at QD=1, not the 256 specified. That looks odd, but > I'll try, maybe it'll hit something different. Can you try this patch? It's not perfect, but I'll be interested if it makes a difference for you. diff --git a/block/blk-mq.c b/block/blk-mq.c index 758a9b5..41def54 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -810,11 +810,11 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx) struct list_head *dptr; int queued; - WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask)); - if (unlikely(test_bit(BLK_MQ_S_STOPPED, &hctx->state))) return; + WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask)); + hctx->run++; /* @@ -1075,15 +1075,11 @@ static void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, } void blk_mq_insert_request(struct request *rq, bool at_head, bool run_queue, - bool async) + bool async) { + struct blk_mq_ctx *ctx = rq->mq_ctx; struct request_queue *q = rq->q; struct blk_mq_hw_ctx *hctx; - struct blk_mq_ctx *ctx = rq->mq_ctx, *current_ctx; - - current_ctx = blk_mq_get_ctx(q); - if (!cpu_online(ctx->cpu)) - rq->mq_ctx = ctx = current_ctx; hctx = q->mq_ops->map_queue(q, ctx->cpu); @@ -1093,8 +1089,6 @@ void blk_mq_insert_request(struct request *rq, bool at_head, bool run_queue, if (run_queue) blk_mq_run_hw_queue(hctx, async); - - blk_mq_put_ctx(current_ctx); } static void blk_mq_insert_requests(struct request_queue *q, @@ -1105,14 +1099,9 @@ static void blk_mq_insert_requests(struct request_queue *q, { struct blk_mq_hw_ctx *hctx; - struct blk_mq_ctx *current_ctx; trace_block_unplug(q, depth, !from_schedule); - current_ctx = blk_mq_get_ctx(q); - - if (!cpu_online(ctx->cpu)) - ctx = current_ctx; hctx = q->mq_ops->map_queue(q, ctx->cpu); /* @@ -1125,14 +1114,12 @@ static void blk_mq_insert_requests(struct request_queue *q, rq = list_first_entry(list, struct request, queuelist); list_del_init(&rq->queuelist); - rq->mq_ctx = ctx; __blk_mq_insert_req_list(hctx, ctx, rq, false); } blk_mq_hctx_mark_pending(hctx, ctx); spin_unlock(&ctx->lock); blk_mq_run_hw_queue(hctx, from_schedule); - blk_mq_put_ctx(current_ctx); } static int plug_ctx_cmp(void *priv, struct list_head *a, struct list_head *b) @@ -1692,6 +1679,11 @@ static int blk_mq_hctx_cpu_offline(struct blk_mq_hw_ctx *hctx, int cpu) while (!list_empty(&tmp)) { struct request *rq; + /* + * FIXME: we can't just move the req here. We'd have to + * pull off the bio chain and add it to a new request + * on the target hw queue + */ rq = list_first_entry(&tmp, struct request, queuelist); rq->mq_ctx = ctx; list_move_tail(&rq->queuelist, &ctx->rq_list); -- Jens Axboe ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: Oops when completing request on the wrong queue 2016-08-24 18:34 ` Jens Axboe @ 2016-08-24 20:36 ` Jens Axboe 2016-08-29 18:06 ` Gabriel Krisman Bertazi 0 siblings, 1 reply; 12+ messages in thread From: Jens Axboe @ 2016-08-24 20:36 UTC (permalink / raw) To: Gabriel Krisman Bertazi Cc: Keith Busch, Christoph Hellwig, linux-nvme, Brian King, linux-block, linux-scsi On 08/24/2016 12:34 PM, Jens Axboe wrote: > On 08/23/2016 03:14 PM, Jens Axboe wrote: >> On 08/23/2016 03:11 PM, Jens Axboe wrote: >>> On 08/23/2016 02:54 PM, Gabriel Krisman Bertazi wrote: >>>> Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes: >>>> >>>>>> Can you share what you ran to online/offline CPUs? I can't reproduce >>>>>> this here. >>>>> >>>>> I was using the ppc64_cpu tool, which shouldn't do nothing more than >>>>> write to sysfs. but I just reproduced it with the script below. >>>>> >>>>> Note that this is ppc64le. I don't have a x86 in hand to attempt to >>>>> reproduce right now, but I'll look for one and see how it goes. >>>> >>>> Hi, >>>> >>>> Any luck on reproducing it? We were initially reproducing with a >>>> proprietary stress test, but I gave a try to a generated fio jobfile >>>> associated with the SMT script I shared earlier and I could reproduce >>>> the crash consistently in less than 10 minutes of execution. this was >>>> still ppc64le, though. I couldn't get my hands on nvme on x86 yet. >>> >>> Nope, I have not been able to reproduce it. How long does the CPU >>> offline/online actions take on ppc64? It's pretty slow on x86, which may >>> hide the issue. I took out the various printk's associated with bringing >>> a CPU off/online, as well as IRQ breaking parts, but didn't help in >>> reproducing it. >>> >>>> The job file I used, as well as the smt.sh script, in case you want to >>>> give it a try: >>>> >>>> jobfile: http://krisman.be/k/nvmejob.fio >>>> smt.sh: http://krisman.be/k/smt.sh >>>> >>>> Still, the trigger seems to be consistently a heavy load of IO >>>> associated with CPU addition/removal. >>> >>> My workload looks similar to yours, in that it's high depth and with a >>> lot of jobs to keep most CPUs loaded. My bash script is different than >>> yours, I'll try that and see if it helps here. >> >> Actually, I take that back. You're not using O_DIRECT, hence all your >> jobs are running at QD=1, not the 256 specified. That looks odd, but >> I'll try, maybe it'll hit something different. > > Can you try this patch? It's not perfect, but I'll be interested if it > makes a difference for you. This one should handle the WARN_ON() for running the hw queue on the wrong CPU as well. diff --git a/block/blk-mq.c b/block/blk-mq.c index 758a9b5..b21a9b9 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -810,11 +810,12 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx) struct list_head *dptr; int queued; - WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask)); - if (unlikely(test_bit(BLK_MQ_S_STOPPED, &hctx->state))) return; + WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) && + cpu_online(hctx->next_cpu)); + hctx->run++; /* @@ -1075,15 +1082,11 @@ static void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, } void blk_mq_insert_request(struct request *rq, bool at_head, bool run_queue, - bool async) + bool async) { + struct blk_mq_ctx *ctx = rq->mq_ctx; struct request_queue *q = rq->q; struct blk_mq_hw_ctx *hctx; - struct blk_mq_ctx *ctx = rq->mq_ctx, *current_ctx; - - current_ctx = blk_mq_get_ctx(q); - if (!cpu_online(ctx->cpu)) - rq->mq_ctx = ctx = current_ctx; hctx = q->mq_ops->map_queue(q, ctx->cpu); @@ -1093,8 +1096,6 @@ void blk_mq_insert_request(struct request *rq, bool at_head, bool run_queue, if (run_queue) blk_mq_run_hw_queue(hctx, async); - - blk_mq_put_ctx(current_ctx); } static void blk_mq_insert_requests(struct request_queue *q, @@ -1105,14 +1106,9 @@ static void blk_mq_insert_requests(struct request_queue *q, { struct blk_mq_hw_ctx *hctx; - struct blk_mq_ctx *current_ctx; trace_block_unplug(q, depth, !from_schedule); - current_ctx = blk_mq_get_ctx(q); - - if (!cpu_online(ctx->cpu)) - ctx = current_ctx; hctx = q->mq_ops->map_queue(q, ctx->cpu); /* @@ -1125,14 +1121,12 @@ static void blk_mq_insert_requests(struct request_queue *q, rq = list_first_entry(list, struct request, queuelist); list_del_init(&rq->queuelist); - rq->mq_ctx = ctx; __blk_mq_insert_req_list(hctx, ctx, rq, false); } blk_mq_hctx_mark_pending(hctx, ctx); spin_unlock(&ctx->lock); blk_mq_run_hw_queue(hctx, from_schedule); - blk_mq_put_ctx(current_ctx); } static int plug_ctx_cmp(void *priv, struct list_head *a, struct list_head *b) @@ -1692,6 +1686,11 @@ static int blk_mq_hctx_cpu_offline(struct blk_mq_hw_ctx *hctx, int cpu) while (!list_empty(&tmp)) { struct request *rq; + /* + * FIXME: we can't just move the req here. We'd have to + * pull off the bio chain and add it to a new request + * on the target hw queue + */ rq = list_first_entry(&tmp, struct request, queuelist); rq->mq_ctx = ctx; list_move_tail(&rq->queuelist, &ctx->rq_list); -- Jens Axboe ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: Oops when completing request on the wrong queue 2016-08-24 20:36 ` Jens Axboe @ 2016-08-29 18:06 ` Gabriel Krisman Bertazi 2016-08-29 18:40 ` Jens Axboe 0 siblings, 1 reply; 12+ messages in thread From: Gabriel Krisman Bertazi @ 2016-08-29 18:06 UTC (permalink / raw) To: Jens Axboe Cc: Keith Busch, Christoph Hellwig, linux-nvme, Brian King, linux-block, linux-scsi Jens Axboe <axboe@kernel.dk> writes: >> Can you try this patch? It's not perfect, but I'll be interested if it >> makes a difference for you. > Hi Jens, Sorry for the delay. I just got back to this and have been running your patch on top of 4.8 without a crash for over 1 hour. I wanna give it more time to make sure it's running properly, though. Let me get back to you after a few more rounds of test. > This one should handle the WARN_ON() for running the hw queue on the > wrong CPU as well. On the workaround you added to prevent WARN_ON, we surely need to prevent blk_mq_hctx_next_cpu from scheduling dead cpus in the first place, right.. How do you feel about the following RFC? I know it's not a complete fix, but it feels like a good improvement to me. http://www.spinics.net/lists/linux-scsi/msg98608.html -- Gabriel Krisman Bertazi ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Oops when completing request on the wrong queue 2016-08-29 18:06 ` Gabriel Krisman Bertazi @ 2016-08-29 18:40 ` Jens Axboe 0 siblings, 0 replies; 12+ messages in thread From: Jens Axboe @ 2016-08-29 18:40 UTC (permalink / raw) To: Gabriel Krisman Bertazi Cc: Keith Busch, Christoph Hellwig, linux-nvme, Brian King, linux-block, linux-scsi On 08/29/2016 12:06 PM, Gabriel Krisman Bertazi wrote: > Jens Axboe <axboe@kernel.dk> writes: >>> Can you try this patch? It's not perfect, but I'll be interested if it >>> makes a difference for you. >> > > Hi Jens, > > Sorry for the delay. I just got back to this and have been running your > patch on top of 4.8 without a crash for over 1 hour. I wanna give it > more time to make sure it's running properly, though. > > Let me get back to you after a few more rounds of test. Thanks, sounds good. The patches have landed in mainline too. >> This one should handle the WARN_ON() for running the hw queue on the >> wrong CPU as well. > > On the workaround you added to prevent WARN_ON, we surely need to > prevent blk_mq_hctx_next_cpu from scheduling dead cpus in the first > place, right.. How do you feel about the following RFC? I know it's > not a complete fix, but it feels like a good improvement to me. > > http://www.spinics.net/lists/linux-scsi/msg98608.html But we can't completely prevent it, and I don't think we have to. I just don't want to trigger a warning for something that's a valid condition. I want the warning to trigger if this happens without the CPU going offline, since then it's indicative of a real bug in the mapping. Your patch isn't going to prevent it either - it'll shrink the window, at the expense of making blk_mq_hctx_next_cpu() more expensive. So I don't think it's worthwhile. -- Jens Axboe ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2016-08-29 18:40 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <87a8gltgks.fsf@linux.vnet.ibm.com>
2016-08-19 13:28 ` Oops when completing request on the wrong queue Gabriel Krisman Bertazi
2016-08-19 14:13 ` Jens Axboe
2016-08-19 15:51 ` Jens Axboe
2016-08-19 16:38 ` Gabriel Krisman Bertazi
2016-08-23 20:54 ` Gabriel Krisman Bertazi
2016-08-23 21:11 ` Jens Axboe
2016-08-23 21:14 ` Jens Axboe
2016-08-23 22:49 ` Keith Busch
2016-08-24 18:34 ` Jens Axboe
2016-08-24 20:36 ` Jens Axboe
2016-08-29 18:06 ` Gabriel Krisman Bertazi
2016-08-29 18:40 ` Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).