* [PATCH 0/2] blk-mq: two patches related with CPU hotplug
@ 2018-01-17 12:34 Ming Lei
2018-01-17 12:34 ` [PATCH 1/2] blk-mq: make sure hctx->next_cpu is set correctly Ming Lei
2018-01-17 12:34 ` [PATCH 2/2] blk-mq: convert WARN_ON in __blk_mq_run_hw_queue to printk Ming Lei
0 siblings, 2 replies; 9+ messages in thread
From: Ming Lei @ 2018-01-17 12:34 UTC (permalink / raw)
To: Jens Axboe, linux-block, Thomas Gleixner
Cc: Christoph Hellwig, jianchao . wang, Christian Borntraeger,
Ming Lei
Hi Jens,
The 1st one fixes one regression caused by 20e4d81393 ("blk-mq: simplify
queue mapping & schedule with each possisble CPU").
The 2nd one add commens about current race, and convert the WARN_ON
into printk(KERN_WARN) since this warning is harmless.
Thanks,
Ming Lei (2):
blk-mq: make sure hctx->next_cpu is set correctly
blk-mq: convert WARN_ON in __blk_mq_run_hw_queue to printk
block/blk-mq.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 47 insertions(+), 4 deletions(-)
--
2.9.5
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 1/2] blk-mq: make sure hctx->next_cpu is set correctly
2018-01-17 12:34 [PATCH 0/2] blk-mq: two patches related with CPU hotplug Ming Lei
@ 2018-01-17 12:34 ` Ming Lei
2018-01-17 15:45 ` Jens Axboe
2018-01-17 12:34 ` [PATCH 2/2] blk-mq: convert WARN_ON in __blk_mq_run_hw_queue to printk Ming Lei
1 sibling, 1 reply; 9+ messages in thread
From: Ming Lei @ 2018-01-17 12:34 UTC (permalink / raw)
To: Jens Axboe, linux-block, Thomas Gleixner
Cc: Christoph Hellwig, jianchao . wang, Christian Borntraeger,
Ming Lei, Stefan Haberland, Christoph Hellwig
When hctx->next_cpu is set from possible online CPUs, there is one
race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
break workqueue.
The race is triggered when one CPU is becoming DEAD, blk_mq_hctx_notify_dead()
is called to dispatch requests from the DEAD cpu context, but at that
time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
a bad value.
This patch deals with this issue by re-selecting next CPU, and make
sure it is set correctly.
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Tested-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq.c | 31 +++++++++++++++++++++++++++++--
1 file changed, 29 insertions(+), 2 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c376d1b6309a..dc4066d28323 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1416,21 +1416,48 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
*/
static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
{
+ bool tried = false;
+
if (hctx->queue->nr_hw_queues == 1)
return WORK_CPU_UNBOUND;
if (--hctx->next_cpu_batch <= 0) {
int next_cpu;
-
+select_cpu:
next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask,
cpu_online_mask);
if (next_cpu >= nr_cpu_ids)
next_cpu = cpumask_first_and(hctx->cpumask,cpu_online_mask);
- hctx->next_cpu = next_cpu;
+ /*
+ * No online CPU is found here, and this may happen when
+ * running from blk_mq_hctx_notify_dead(), and make sure
+ * hctx->next_cpu is set correctly for not breaking workqueue.
+ */
+ if (next_cpu >= nr_cpu_ids)
+ hctx->next_cpu = cpumask_first(hctx->cpumask);
+ else
+ hctx->next_cpu = next_cpu;
hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
}
+ /*
+ * Do unbound schedule if we can't find a online CPU for this hctx,
+ * and it should only happen in the path of handling CPU DEAD.
+ */
+ if (!cpu_online(hctx->next_cpu)) {
+ if (!tried) {
+ tried = true;
+ goto select_cpu;
+ }
+
+ /*
+ * Make sure to re-select CPU next time once after CPUs
+ * in hctx->cpumask become online again.
+ */
+ hctx->next_cpu_batch = 1;
+ return WORK_CPU_UNBOUND;
+ }
return hctx->next_cpu;
}
--
2.9.5
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 2/2] blk-mq: convert WARN_ON in __blk_mq_run_hw_queue to printk
2018-01-17 12:34 [PATCH 0/2] blk-mq: two patches related with CPU hotplug Ming Lei
2018-01-17 12:34 ` [PATCH 1/2] blk-mq: make sure hctx->next_cpu is set correctly Ming Lei
@ 2018-01-17 12:34 ` Ming Lei
2018-01-17 15:46 ` Jens Axboe
1 sibling, 1 reply; 9+ messages in thread
From: Ming Lei @ 2018-01-17 12:34 UTC (permalink / raw)
To: Jens Axboe, linux-block, Thomas Gleixner
Cc: Christoph Hellwig, jianchao . wang, Christian Borntraeger,
Ming Lei, Stefan Haberland, Christoph Hellwig
We know this WARN_ON is harmless and the stack trace isn't useful too,
so convert it to printk(), and avoid to confuse people.
Also add comment about two releated races here.
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq.c | 20 ++++++++++++++++++--
1 file changed, 18 insertions(+), 2 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index dc4066d28323..6562360bf108 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1391,9 +1391,25 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
/*
* We should be running this queue from one of the CPUs that
* are mapped to it.
+ *
+ * There are at least two related races now between setting
+ * hctx->next_cpu from blk_mq_hctx_next_cpu() and running
+ * __blk_mq_run_hw_queue():
+ *
+ * - hctx->next_cpu is found offline in blk_mq_hctx_next_cpu(),
+ * but later it becomes online, then this warning is harmless
+ * at all
+ *
+ * - hctx->next_cpu is found online in blk_mq_hctx_next_cpu(),
+ * but later it becomes offline, then the warning can't be
+ * triggered, and we depend on blk-mq timeout handler to
+ * handle dispatched requests to this hctx
*/
- WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
- cpu_online(hctx->next_cpu));
+ if (!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
+ cpu_online(hctx->next_cpu))
+ printk(KERN_WARNING "run queue from wrong CPU %d, hctx %s\n",
+ raw_smp_processor_id(),
+ cpumask_empty(hctx->cpumask) ? "inactive": "active");
/*
* We can't run the queue inline with ints disabled. Ensure that
--
2.9.5
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] blk-mq: make sure hctx->next_cpu is set correctly
2018-01-17 12:34 ` [PATCH 1/2] blk-mq: make sure hctx->next_cpu is set correctly Ming Lei
@ 2018-01-17 15:45 ` Jens Axboe
2018-01-17 16:14 ` Ming Lei
0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2018-01-17 15:45 UTC (permalink / raw)
To: Ming Lei, linux-block, Thomas Gleixner
Cc: Christoph Hellwig, jianchao . wang, Christian Borntraeger,
Stefan Haberland, Christoph Hellwig
On 1/17/18 5:34 AM, Ming Lei wrote:
> When hctx->next_cpu is set from possible online CPUs, there is one
> race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
> break workqueue.
>
> The race is triggered when one CPU is becoming DEAD, blk_mq_hctx_notify_dead()
> is called to dispatch requests from the DEAD cpu context, but at that
> time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
> CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
> a bad value.
>
> This patch deals with this issue by re-selecting next CPU, and make
> sure it is set correctly.
>
> Cc: Christian Borntraeger <borntraeger@de.ibm.com>
> Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Reported-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
> Tested-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
> Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> block/blk-mq.c | 31 +++++++++++++++++++++++++++++--
> 1 file changed, 29 insertions(+), 2 deletions(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index c376d1b6309a..dc4066d28323 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1416,21 +1416,48 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
> */
> static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
> {
> + bool tried = false;
> +
> if (hctx->queue->nr_hw_queues == 1)
> return WORK_CPU_UNBOUND;
>
> if (--hctx->next_cpu_batch <= 0) {
> int next_cpu;
> -
> +select_cpu:
> next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask,
> cpu_online_mask);
> if (next_cpu >= nr_cpu_ids)
> next_cpu = cpumask_first_and(hctx->cpumask,cpu_online_mask);
>
> - hctx->next_cpu = next_cpu;
> + /*
> + * No online CPU is found here, and this may happen when
> + * running from blk_mq_hctx_notify_dead(), and make sure
> + * hctx->next_cpu is set correctly for not breaking workqueue.
> + */
> + if (next_cpu >= nr_cpu_ids)
> + hctx->next_cpu = cpumask_first(hctx->cpumask);
> + else
> + hctx->next_cpu = next_cpu;
> hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
Since this should _only_ happen from the offline notification, why don't
we move the reset/update logic in that path and out of the hot path?
--
Jens Axboe
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] blk-mq: convert WARN_ON in __blk_mq_run_hw_queue to printk
2018-01-17 12:34 ` [PATCH 2/2] blk-mq: convert WARN_ON in __blk_mq_run_hw_queue to printk Ming Lei
@ 2018-01-17 15:46 ` Jens Axboe
2018-01-17 16:05 ` Christian Borntraeger
2018-01-17 16:17 ` Ming Lei
0 siblings, 2 replies; 9+ messages in thread
From: Jens Axboe @ 2018-01-17 15:46 UTC (permalink / raw)
To: Ming Lei, linux-block, Thomas Gleixner
Cc: Christoph Hellwig, jianchao . wang, Christian Borntraeger,
Stefan Haberland, Christoph Hellwig
On 1/17/18 5:34 AM, Ming Lei wrote:
> We know this WARN_ON is harmless and the stack trace isn't useful too,
> so convert it to printk(), and avoid to confuse people.
I disagree, it is useful to know the exact path it happened from,
in case it's a valid warning. It could be an inline run and
we screwed up the logic, or it could be from a workqueue and
the reason would be entirely different.
--
Jens Axboe
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] blk-mq: convert WARN_ON in __blk_mq_run_hw_queue to printk
2018-01-17 15:46 ` Jens Axboe
@ 2018-01-17 16:05 ` Christian Borntraeger
2018-01-17 16:07 ` Jens Axboe
2018-01-17 16:17 ` Ming Lei
1 sibling, 1 reply; 9+ messages in thread
From: Christian Borntraeger @ 2018-01-17 16:05 UTC (permalink / raw)
To: Jens Axboe, Ming Lei, linux-block, Thomas Gleixner
Cc: Christoph Hellwig, jianchao . wang, Stefan Haberland,
Christoph Hellwig
On 01/17/2018 04:46 PM, Jens Axboe wrote:
> On 1/17/18 5:34 AM, Ming Lei wrote:
>> We know this WARN_ON is harmless and the stack trace isn't useful too,
>> so convert it to printk(), and avoid to confuse people.
>
> I disagree, it is useful to know the exact path it happened from,
> in case it's a valid warning. It could be an inline run and
> we screwed up the logic, or it could be from a workqueue and
> the reason would be entirely different.
Then add a dump_stack or whatever, but WARN_ON does have fatal effects
for some setups. If it can happen then WARN_ON is just wrong.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] blk-mq: convert WARN_ON in __blk_mq_run_hw_queue to printk
2018-01-17 16:05 ` Christian Borntraeger
@ 2018-01-17 16:07 ` Jens Axboe
0 siblings, 0 replies; 9+ messages in thread
From: Jens Axboe @ 2018-01-17 16:07 UTC (permalink / raw)
To: Christian Borntraeger, Ming Lei, linux-block, Thomas Gleixner
Cc: Christoph Hellwig, jianchao . wang, Stefan Haberland,
Christoph Hellwig
On 1/17/18 9:05 AM, Christian Borntraeger wrote:
>
>
> On 01/17/2018 04:46 PM, Jens Axboe wrote:
>> On 1/17/18 5:34 AM, Ming Lei wrote:
>>> We know this WARN_ON is harmless and the stack trace isn't useful too,
>>> so convert it to printk(), and avoid to confuse people.
>>
>> I disagree, it is useful to know the exact path it happened from,
>> in case it's a valid warning. It could be an inline run and
>> we screwed up the logic, or it could be from a workqueue and
>> the reason would be entirely different.
>
> Then add a dump_stack or whatever, but WARN_ON does have fatal effects
> for some setups. If it can happen then WARN_ON is just wrong.
dump_stack() is fine - and the intent is for it to never happen, it
would be nice to close those holes so we're only catching cases that
are due to bad code.
--
Jens Axboe
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] blk-mq: make sure hctx->next_cpu is set correctly
2018-01-17 15:45 ` Jens Axboe
@ 2018-01-17 16:14 ` Ming Lei
0 siblings, 0 replies; 9+ messages in thread
From: Ming Lei @ 2018-01-17 16:14 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Thomas Gleixner, Christoph Hellwig, jianchao . wang,
Christian Borntraeger, Stefan Haberland, Christoph Hellwig
On Wed, Jan 17, 2018 at 08:45:44AM -0700, Jens Axboe wrote:
> On 1/17/18 5:34 AM, Ming Lei wrote:
> > When hctx->next_cpu is set from possible online CPUs, there is one
> > race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
> > break workqueue.
> >
> > The race is triggered when one CPU is becoming DEAD, blk_mq_hctx_notify_dead()
> > is called to dispatch requests from the DEAD cpu context, but at that
> > time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
> > CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
> > a bad value.
> >
> > This patch deals with this issue by re-selecting next CPU, and make
> > sure it is set correctly.
> >
> > Cc: Christian Borntraeger <borntraeger@de.ibm.com>
> > Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Reported-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
> > Tested-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
> > Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> > block/blk-mq.c | 31 +++++++++++++++++++++++++++++--
> > 1 file changed, 29 insertions(+), 2 deletions(-)
> >
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index c376d1b6309a..dc4066d28323 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1416,21 +1416,48 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
> > */
> > static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
> > {
> > + bool tried = false;
> > +
> > if (hctx->queue->nr_hw_queues == 1)
> > return WORK_CPU_UNBOUND;
> >
> > if (--hctx->next_cpu_batch <= 0) {
> > int next_cpu;
> > -
> > +select_cpu:
> > next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask,
> > cpu_online_mask);
> > if (next_cpu >= nr_cpu_ids)
> > next_cpu = cpumask_first_and(hctx->cpumask,cpu_online_mask);
> >
> > - hctx->next_cpu = next_cpu;
> > + /*
> > + * No online CPU is found here, and this may happen when
> > + * running from blk_mq_hctx_notify_dead(), and make sure
> > + * hctx->next_cpu is set correctly for not breaking workqueue.
> > + */
> > + if (next_cpu >= nr_cpu_ids)
> > + hctx->next_cpu = cpumask_first(hctx->cpumask);
> > + else
> > + hctx->next_cpu = next_cpu;
> > hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
>
> Since this should _only_ happen from the offline notification, why don't
> we move the reset/update logic in that path and out of the hot path?
This way is clean, and the only cost introduced is the following two
'if' in hot path:
if (next_cpu >= nr_cpu_ids)
and
if (!cpu_online(hctx->next_cpu))
Now I think it can be triggered not only in CPU dead path, but also
in normal run queue, and it is just easy to trigger it in CPU dead
path in Jianchao's test:
- blk_mq_delay_run_hw_queue() is called from CPU B, and found the queue
should be run on other CPUs
- then blk_mq_hctx_next_cpu() is called, and we have to check if
the destination CPU(the computed hctx->next_cpu) is still online.
Then the check is required in hot path too, but given the cost is
small enough, so it shouldn't be a bid deal.
Or we have to introduce CPU hotplug handler to update hctx->cpumask,
and it should be doable with help of RCU. We can work towards to
this direction if you don't mind.
Thanks,
Ming
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] blk-mq: convert WARN_ON in __blk_mq_run_hw_queue to printk
2018-01-17 15:46 ` Jens Axboe
2018-01-17 16:05 ` Christian Borntraeger
@ 2018-01-17 16:17 ` Ming Lei
1 sibling, 0 replies; 9+ messages in thread
From: Ming Lei @ 2018-01-17 16:17 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Thomas Gleixner, Christoph Hellwig, jianchao . wang,
Christian Borntraeger, Stefan Haberland, Christoph Hellwig
On Wed, Jan 17, 2018 at 08:46:40AM -0700, Jens Axboe wrote:
> On 1/17/18 5:34 AM, Ming Lei wrote:
> > We know this WARN_ON is harmless and the stack trace isn't useful too,
> > so convert it to printk(), and avoid to confuse people.
>
> I disagree, it is useful to know the exact path it happened from,
> in case it's a valid warning. It could be an inline run and
inline run is only possible in theory, since the window is too
small.
> we screwed up the logic, or it could be from a workqueue and
> the reason would be entirely different.
OK, if you don't agree, I will just add the comment on the races
we found.
--
Ming
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2018-01-17 16:17 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-01-17 12:34 [PATCH 0/2] blk-mq: two patches related with CPU hotplug Ming Lei
2018-01-17 12:34 ` [PATCH 1/2] blk-mq: make sure hctx->next_cpu is set correctly Ming Lei
2018-01-17 15:45 ` Jens Axboe
2018-01-17 16:14 ` Ming Lei
2018-01-17 12:34 ` [PATCH 2/2] blk-mq: convert WARN_ON in __blk_mq_run_hw_queue to printk Ming Lei
2018-01-17 15:46 ` Jens Axboe
2018-01-17 16:05 ` Christian Borntraeger
2018-01-17 16:07 ` Jens Axboe
2018-01-17 16:17 ` Ming Lei
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox