All of lore.kernel.org
 help / color / mirror / Atom feed
From: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: Keith Busch <keith.busch@intel.com>,
	Christoph Hellwig <hch@lst.de>,
	linux-nvme@lists.infradead.org,
	Brian King <brking@linux.vnet.ibm.com>,
	linux-block@vger.kernel.org, linux-scsi@vger.kernel.org
Subject: Re: Oops when completing request on the wrong queue
Date: Tue, 23 Aug 2016 17:54:03 -0300	[thread overview]
Message-ID: <87a8g39pg4.fsf@linux.vnet.ibm.com> (raw)
In-Reply-To: <87shu0hfye.fsf@linux.vnet.ibm.com> (Gabriel Krisman Bertazi's message of "Fri, 19 Aug 2016 13:38:17 -0300")

Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes:

>> Can you share what you ran to online/offline CPUs? I can't reproduce
>> this here.
>
> I was using the ppc64_cpu tool, which shouldn't do nothing more than
> write to sysfs.  but I just reproduced it with the script below.
>
> Note that this is ppc64le.  I don't have a x86 in hand to attempt to
> reproduce right now, but I'll look for one and see how it goes.

Hi,

Any luck on reproducing it?  We were initially reproducing with a
proprietary stress test, but I gave a try to a generated fio jobfile
associated with the SMT script I shared earlier and I could reproduce
the crash consistently in less than 10 minutes of execution.  this was
still ppc64le, though.  I couldn't get my hands on nvme on x86 yet.

The job file I used, as well as the smt.sh script, in case you want to
give it a try:

jobfile: http://krisman.be/k/nvmejob.fio
smt.sh:  http://krisman.be/k/smt.sh

Still, the trigger seems to be consistently a heavy load of IO
associated with CPU addition/removal.

Let me share my progress from the last couple days in the hope that it
rings a bell for you.

Firstly, I verified that when we hit the BUG_ON in nvme_queue_rq, the
request_queue's freeze_depth is 0, which points away from a fault in the
freeze/unfreeze mechanism.  If a request was escaping and going through
the block layer during a freeze, we'd see freeze_depth >= 1.  Before
that, I had also tried to keep the q_usage_counter in atomic mode, in
case of a bug in the percpu refcount.  No luck, the BUG_ON was still
hit.

Also, I don't see anything special about the request that reaches the
BUG_ON.  It's a REQ_TYPE_FS request and, at least in the last time I
reproduced, it was a READ that came from the stress test task through
submit_bio.  So nothing remarkable about it too, as far as I can see.

I'm still thinking about a case in which the mapping get's screwed up,
where a ctx would appear into two hctxs bitmaps after a remap, or if the
ctx got remaped to another hctx.  I'm still learning my way through the
cpumap code, so I'm not sure it's a real possibility, but I'm not
convinced it isn't.  Some preliminary tests don't suggest it's the case
at play, but I wanna spend a little more time on this theory (maybe
for my lack of better ideas :)

On a side note, probably unrelated to this crash, it also got me
thinking about the current usefulness of blk_mq_hctx_notify.  Since CPU
is dead, no more requests would be coming through its ctx.  I think we
could force a queue run in blk_mq_queue_reinit_notify, before remapping,
which would cause the hctx to fetch the remaining requests from that
dead ctx (since it's not unmapped yet).  This way, we could maintain a
single hotplug notification hook and simplify the hotplug path.  I
haven't written code for it yet, but I'll see if I can come up with
something and send to the list.

-- 
Gabriel Krisman Bertazi

WARNING: multiple messages have this Message-ID (diff)
From: krisman@linux.vnet.ibm.com (Gabriel Krisman Bertazi)
Subject: Oops when completing request on the wrong queue
Date: Tue, 23 Aug 2016 17:54:03 -0300	[thread overview]
Message-ID: <87a8g39pg4.fsf@linux.vnet.ibm.com> (raw)
In-Reply-To: <87shu0hfye.fsf@linux.vnet.ibm.com> (Gabriel Krisman Bertazi's message of "Fri, 19 Aug 2016 13:38:17 -0300")

Gabriel Krisman Bertazi <krisman at linux.vnet.ibm.com> writes:

>> Can you share what you ran to online/offline CPUs? I can't reproduce
>> this here.
>
> I was using the ppc64_cpu tool, which shouldn't do nothing more than
> write to sysfs.  but I just reproduced it with the script below.
>
> Note that this is ppc64le.  I don't have a x86 in hand to attempt to
> reproduce right now, but I'll look for one and see how it goes.

Hi,

Any luck on reproducing it?  We were initially reproducing with a
proprietary stress test, but I gave a try to a generated fio jobfile
associated with the SMT script I shared earlier and I could reproduce
the crash consistently in less than 10 minutes of execution.  this was
still ppc64le, though.  I couldn't get my hands on nvme on x86 yet.

The job file I used, as well as the smt.sh script, in case you want to
give it a try:

jobfile: http://krisman.be/k/nvmejob.fio
smt.sh:  http://krisman.be/k/smt.sh

Still, the trigger seems to be consistently a heavy load of IO
associated with CPU addition/removal.

Let me share my progress from the last couple days in the hope that it
rings a bell for you.

Firstly, I verified that when we hit the BUG_ON in nvme_queue_rq, the
request_queue's freeze_depth is 0, which points away from a fault in the
freeze/unfreeze mechanism.  If a request was escaping and going through
the block layer during a freeze, we'd see freeze_depth >= 1.  Before
that, I had also tried to keep the q_usage_counter in atomic mode, in
case of a bug in the percpu refcount.  No luck, the BUG_ON was still
hit.

Also, I don't see anything special about the request that reaches the
BUG_ON.  It's a REQ_TYPE_FS request and, at least in the last time I
reproduced, it was a READ that came from the stress test task through
submit_bio.  So nothing remarkable about it too, as far as I can see.

I'm still thinking about a case in which the mapping get's screwed up,
where a ctx would appear into two hctxs bitmaps after a remap, or if the
ctx got remaped to another hctx.  I'm still learning my way through the
cpumap code, so I'm not sure it's a real possibility, but I'm not
convinced it isn't.  Some preliminary tests don't suggest it's the case
at play, but I wanna spend a little more time on this theory (maybe
for my lack of better ideas :)

On a side note, probably unrelated to this crash, it also got me
thinking about the current usefulness of blk_mq_hctx_notify.  Since CPU
is dead, no more requests would be coming through its ctx.  I think we
could force a queue run in blk_mq_queue_reinit_notify, before remapping,
which would cause the hctx to fetch the remaining requests from that
dead ctx (since it's not unmapped yet).  This way, we could maintain a
single hotplug notification hook and simplify the hotplug path.  I
haven't written code for it yet, but I'll see if I can come up with
something and send to the list.

-- 
Gabriel Krisman Bertazi

  reply	other threads:[~2016-08-23 20:54 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-08-10  4:04 Oops when completing request on the wrong queue Gabriel Krisman Bertazi
2016-08-11 17:16 ` Keith Busch
2016-08-11 18:10   ` Gabriel Krisman Bertazi
2016-08-19 13:28 ` Gabriel Krisman Bertazi
2016-08-19 13:28   ` Gabriel Krisman Bertazi
2016-08-19 14:13   ` Jens Axboe
2016-08-19 14:13     ` Jens Axboe
2016-08-19 15:51     ` Jens Axboe
2016-08-19 15:51       ` Jens Axboe
2016-08-19 16:38       ` Gabriel Krisman Bertazi
2016-08-19 16:38         ` Gabriel Krisman Bertazi
2016-08-23 20:54         ` Gabriel Krisman Bertazi [this message]
2016-08-23 20:54           ` Gabriel Krisman Bertazi
2016-08-23 21:11           ` Jens Axboe
2016-08-23 21:11             ` Jens Axboe
2016-08-23 21:14             ` Jens Axboe
2016-08-23 21:14               ` Jens Axboe
2016-08-23 22:49               ` Keith Busch
2016-08-23 22:49                 ` Keith Busch
2016-08-24 18:34               ` Jens Axboe
2016-08-24 18:34                 ` Jens Axboe
2016-08-24 20:36                 ` Jens Axboe
2016-08-24 20:36                   ` Jens Axboe
2016-08-29 18:06                   ` Gabriel Krisman Bertazi
2016-08-29 18:06                     ` Gabriel Krisman Bertazi
2016-08-29 18:40                     ` Jens Axboe
2016-08-29 18:40                       ` Jens Axboe
2016-09-05 12:02                       ` Gabriel Krisman Bertazi
2016-09-05 12:02                         ` Gabriel Krisman Bertazi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87a8g39pg4.fsf@linux.vnet.ibm.com \
    --to=krisman@linux.vnet.ibm.com \
    --cc=axboe@kernel.dk \
    --cc=brking@linux.vnet.ibm.com \
    --cc=hch@lst.de \
    --cc=keith.busch@intel.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=linux-scsi@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.