[PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU
       [not found]               ` <8c8efce8-ea02-0a9e-8369-44c885f4731d@oracle.com>
@ 2018-01-17  6:22                 ` Ming Lei
  2018-01-17  8:09                   ` jianchao.wang
  0 siblings, 1 reply; 8+ messages in thread
From: Ming Lei @ 2018-01-17  6:22 UTC (permalink / raw)


Hi Jianchao,

On Wed, Jan 17, 2018@01:24:23PM +0800, jianchao.wang wrote:
> Hi ming
> 
> Thanks for your kindly response.
> 
> On 01/17/2018 11:52 AM, Ming Lei wrote:
> >> It is here.
> >> __blk_mq_run_hw_queue()
> >> ....
> >>     WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
> >>         cpu_online(hctx->next_cpu));
> > I think this warning is triggered after the CPU of hctx->next_cpu becomes
> > online again, and it should have been dealt with by the following trick,
> > and this patch is against the previous one, please test it and see if
> > the warning can be fixed.
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 23f0f3ddffcf..0620ccb65e4e 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1452,6 +1452,9 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
> >  			tried = true;
> >  			goto select_cpu;
> >  		}
> > +
> > +		/* handle after this CPU of hctx->next_cpu becomes online again */
> > +		hctx->next_cpu_batch = 1;
> >  		return WORK_CPU_UNBOUND;
> >  	}
> >  	return hctx->next_cpu;
> > 
> 
> The WARNING still could be triggered.
> 
> [  282.194532] WARNING: CPU: 0 PID: 222 at /home/will/u04/source_code/linux-block/block/blk-mq.c:1315 __blk_mq_run_hw_queue+0x92/0xa0
> [  282.194534] Modules linked in: ....
> [  282.194561] CPU: 0 PID: 222 Comm: kworker/2:1H Tainted: G        W        4.15.0-rc7+ #17
> 

This warning can't be removed completely, for example, the CPU figured
in blk_mq_hctx_next_cpu(hctx) can be put on again just after the
following call returns and before __blk_mq_run_hw_queue() is scheduled
to run.

	kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), &hctx->run_work, msecs_to_jiffies(msecs))

Just be curious how you trigger this issue? And is it triggered in CPU
hotplug stress test? Or in a normal use case?

> >> ....
> >>
> >> To eliminate this risk totally, we could blk_mq_hctx_next_cpu return the cpu  even if the cpu is offlined and modify the cpu_online above to cpu_active.
> >> The kworkers of the per-cpu pool must have be migrated back when the cpu is set active.
> >> But there seems to be issues in DASD as your previous comment.
> > Yes, we can't break DASD.
> > 
> >> That is the original version of this patch, and both Christian and Stefan
> >> reported that system can't boot from DASD in this way[2], and I changed
> >> to AND with cpu_online_mask, then their system can boot well
> >> On the other hand, there is also risk in 
> >>
> >> @@ -440,7 +440,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
> >>  		blk_queue_exit(q);
> >>  		return ERR_PTR(-EXDEV);
> >>  	}
> >> -	cpu = cpumask_first(alloc_data.hctx->cpumask);
> >> +	cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
> >>  	alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
> >>
> >> what if the cpus in alloc_data.hctx->cpumask are all offlined ?
> > This one is crazy, and is used by NVMe only, it should be fine if
> > the passed 'hctx_idx' is retrieved by the current running CPU, such
> > as the way of blk_mq_map_queue(). But if not, bad thing may happen.
> Even if retrieved by current running cpu, the task still could be migrated away when the cpu is offlined.
> 

I just checked NVMe code, looks only nvmf_connect_io_queue() passes
one specific 'qid' and calls blk_mq_alloc_request_hctx(); and for others,
NVME_QID_ANY is passed, then blk_mq_alloc_request() is called in
nvme_alloc_request().

CC NVMe list and guys.

Hello James, Keith, Christoph and Sagi,

Looks nvmf_connect_io_queue() is run in the following fragile way:

	for (i = 1; i < ctrl->ctrl.queue_count; i++) {
		...
		nvmf_connect_io_queue(&ctrl->ctrl, i);
		...
	}

The hardware queue idx is passed to nvmf_connect_io_queue() directly
and finally blk_mq_alloc_request_hctx() is called to allocate request
for the queue, but all CPUs mapped to this hw queue may become offline
when running in blk_mq_alloc_request_hctx(), so what is the supposed way
to handle this situation?

Is it fine to return failure simply in blk_mq_alloc_request_hctx()
under this situation? But the CPU may become online later...

Thanks,
Ming

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU
  2018-01-17  6:22                 ` [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU Ming Lei
@ 2018-01-17  8:09                   ` jianchao.wang
  2018-01-17  9:57                     ` Ming Lei
  0 siblings, 1 reply; 8+ messages in thread
From: jianchao.wang @ 2018-01-17  8:09 UTC (permalink / raw)


Hi ming 

Thanks for your kindly response.

On 01/17/2018 02:22 PM, Ming Lei wrote:
> This warning can't be removed completely, for example, the CPU figured
> in blk_mq_hctx_next_cpu(hctx) can be put on again just after the
> following call returns and before __blk_mq_run_hw_queue() is scheduled
> to run.
> 
> 	kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), &hctx->run_work, msecs_to_jiffies(msecs))
We could use cpu_active in __blk_mq_run_hw_queue() to narrow the window.
There is a big gap between cpu_online and cpu_active. rebind_workers is also between them.

> 
> Just be curious how you trigger this issue? And is it triggered in CPU
> hotplug stress test? Or in a normal use case?

In fact, this is my own investigation about whether the .queue_rq to one hardware queue could be executed on
the cpu where it is not mapped. Finally, found this hole when cpu hotplug.
I did the test on NVMe device which has 1-to-1 mapping between cpu and hctx.
 - A special patch that could hold some requests on ctx->rq_list though .get_budget
 - A script issues IOs with fio
 - A script online/offline the cpus continuously
At first, just the warning above. Then after this patch was introduced, panic came up.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU
  2018-01-17  8:09                   ` jianchao.wang
@ 2018-01-17  9:57                     ` Ming Lei
  2018-01-17 10:07                       ` Christian Borntraeger
  2018-01-19  3:05                       ` jianchao.wang
  0 siblings, 2 replies; 8+ messages in thread
From: Ming Lei @ 2018-01-17  9:57 UTC (permalink / raw)

Hi Jianchao,

On Wed, Jan 17, 2018@04:09:11PM +0800, jianchao.wang wrote:
> Hi ming 
> 
> Thanks for your kindly response.
> 
> On 01/17/2018 02:22 PM, Ming Lei wrote:
> > This warning can't be removed completely, for example, the CPU figured
> > in blk_mq_hctx_next_cpu(hctx) can be put on again just after the
> > following call returns and before __blk_mq_run_hw_queue() is scheduled
> > to run.
> > 
> > 	kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), &hctx->run_work, msecs_to_jiffies(msecs))
> We could use cpu_active in __blk_mq_run_hw_queue() to narrow the window.
> There is a big gap between cpu_online and cpu_active. rebind_workers is also between them.

This warning is harmless, also you can't reproduce it without help of your
special patch, I guess, :-) So the window shouldn't be a big deal. 

But it can be a problem about the delay(msecs_to_jiffies(msecs)) passed to
kblockd_mod_delayed_work_on(), because during the period:

1) hctx->next_cpu can become online from offline before __blk_mq_run_hw_queue
is run, your warning is triggered, but it is harmless

2) hctx->next_cpu can become offline from online before __blk_mq_run_hw_queue
is run, there isn't warning, but once the IO is submitted to hardware,
after it is completed, how does the HBA/hw queue notify CPU since CPUs
assigned to this hw queue(irq vector) are offline? blk-mq's timeout
handler may cover that, but looks too tricky.

> 
> > 
> > Just be curious how you trigger this issue? And is it triggered in CPU
> > hotplug stress test? Or in a normal use case?
> 
> In fact, this is my own investigation about whether the .queue_rq to one hardware queue could be executed on
> the cpu where it is not mapped. Finally, found this hole when cpu hotplug.
> I did the test on NVMe device which has 1-to-1 mapping between cpu and hctx.
>  - A special patch that could hold some requests on ctx->rq_list though .get_budget
>  - A script issues IOs with fio
>  - A script online/offline the cpus continuously

Thanks for sharing your reproduction approach.

Without a handler for CPU hotplug, it isn't easy to avoid the warning
completely in __blk_mq_run_hw_queue().

> At first, just the warning above. Then after this patch was introduced, panic came up.

We have to fix the panic, so I will post the patch you tested in this thread.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU
  2018-01-17  9:57                     ` Ming Lei
@ 2018-01-17 10:07                       ` Christian Borntraeger
  2018-01-17 10:14                         ` Christian Borntraeger
  2018-01-17 10:17                         ` Ming Lei
  2018-01-19  3:05                       ` jianchao.wang
  1 sibling, 2 replies; 8+ messages in thread
From: Christian Borntraeger @ 2018-01-17 10:07 UTC (permalink / raw)




On 01/17/2018 10:57 AM, Ming Lei wrote:
> Hi Jianchao,
> 
> On Wed, Jan 17, 2018@04:09:11PM +0800, jianchao.wang wrote:
>> Hi ming 
>>
>> Thanks for your kindly response.
>>
>> On 01/17/2018 02:22 PM, Ming Lei wrote:
>>> This warning can't be removed completely, for example, the CPU figured
>>> in blk_mq_hctx_next_cpu(hctx) can be put on again just after the
>>> following call returns and before __blk_mq_run_hw_queue() is scheduled
>>> to run.
>>>
>>> 	kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), &hctx->run_work, msecs_to_jiffies(msecs))
>> We could use cpu_active in __blk_mq_run_hw_queue() to narrow the window.
>> There is a big gap between cpu_online and cpu_active. rebind_workers is also between them.
> 
> This warning is harmless, also you can't reproduce it without help of your
> special patch, I guess, :-) So the window shouldn't be a big deal. 

FWIW, every WARN_ON is problematic since there are people running with panic_on_warn.
If a condition can happen we should not use WARN_ON but something else.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU
  2018-01-17 10:07                       ` Christian Borntraeger
@ 2018-01-17 10:14                         ` Christian Borntraeger
  2018-01-17 10:17                         ` Ming Lei
  1 sibling, 0 replies; 8+ messages in thread
From: Christian Borntraeger @ 2018-01-17 10:14 UTC (permalink / raw)




On 01/17/2018 11:07 AM, Christian Borntraeger wrote:
> 
> 
> On 01/17/2018 10:57 AM, Ming Lei wrote:
>> Hi Jianchao,
>>
>> On Wed, Jan 17, 2018@04:09:11PM +0800, jianchao.wang wrote:
>>> Hi ming 
>>>
>>> Thanks for your kindly response.
>>>
>>> On 01/17/2018 02:22 PM, Ming Lei wrote:
>>>> This warning can't be removed completely, for example, the CPU figured
>>>> in blk_mq_hctx_next_cpu(hctx) can be put on again just after the
>>>> following call returns and before __blk_mq_run_hw_queue() is scheduled
>>>> to run.
>>>>
>>>> 	kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), &hctx->run_work, msecs_to_jiffies(msecs))
>>> We could use cpu_active in __blk_mq_run_hw_queue() to narrow the window.
>>> There is a big gap between cpu_online and cpu_active. rebind_workers is also between them.
>>
>> This warning is harmless, also you can't reproduce it without help of your
>> special patch, I guess, :-) So the window shouldn't be a big deal. 
> 
> FWIW, every WARN_ON is problematic since there are people running with panic_on_warn.

To make it more clear. Every WARN_ON that can happen in real life without actually being
an error is problematic.

> If a condition can happen we should not use WARN_ON but something else.
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU
  2018-01-17 10:07                       ` Christian Borntraeger
  2018-01-17 10:14                         ` Christian Borntraeger
@ 2018-01-17 10:17                         ` Ming Lei
  1 sibling, 0 replies; 8+ messages in thread
From: Ming Lei @ 2018-01-17 10:17 UTC (permalink / raw)


On Wed, Jan 17, 2018@11:07:48AM +0100, Christian Borntraeger wrote:
> 
> 
> On 01/17/2018 10:57 AM, Ming Lei wrote:
> > Hi Jianchao,
> > 
> > On Wed, Jan 17, 2018@04:09:11PM +0800, jianchao.wang wrote:
> >> Hi ming 
> >>
> >> Thanks for your kindly response.
> >>
> >> On 01/17/2018 02:22 PM, Ming Lei wrote:
> >>> This warning can't be removed completely, for example, the CPU figured
> >>> in blk_mq_hctx_next_cpu(hctx) can be put on again just after the
> >>> following call returns and before __blk_mq_run_hw_queue() is scheduled
> >>> to run.
> >>>
> >>> 	kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), &hctx->run_work, msecs_to_jiffies(msecs))
> >> We could use cpu_active in __blk_mq_run_hw_queue() to narrow the window.
> >> There is a big gap between cpu_online and cpu_active. rebind_workers is also between them.
> > 
> > This warning is harmless, also you can't reproduce it without help of your
> > special patch, I guess, :-) So the window shouldn't be a big deal. 
> 
> FWIW, every WARN_ON is problematic since there are people running with panic_on_warn.
> If a condition can happen we should not use WARN_ON but something else.

Agree, printk() should be fine, IMO.


-- 
Ming

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU
  2018-01-17  9:57                     ` Ming Lei
  2018-01-17 10:07                       ` Christian Borntraeger
@ 2018-01-19  3:05                       ` jianchao.wang
  2018-01-26  9:31                         ` Ming Lei
  1 sibling, 1 reply; 8+ messages in thread
From: jianchao.wang @ 2018-01-19  3:05 UTC (permalink / raw)


Hi ming

Sorry for delayed report this.

On 01/17/2018 05:57 PM, Ming Lei wrote:
> 2) hctx->next_cpu can become offline from online before __blk_mq_run_hw_queue
> is run, there isn't warning, but once the IO is submitted to hardware,
> after it is completed, how does the HBA/hw queue notify CPU since CPUs
> assigned to this hw queue(irq vector) are offline? blk-mq's timeout
> handler may cover that, but looks too tricky.

In theory, the irq affinity will be migrated to other cpu. This is done by
fixup_irqs() in the context of stop_machine.
However, in my test, I found this log:

[  267.161043] do_IRQ: 7.33 No irq handler for vector

The 33 is the vector used by nvme cq.
The irq seems to be missed and sometimes IO hang occurred.
It is not every time, I think maybe due to nvme_process_cq in nvme_queue_rq.

I add dump stack behind the error log and get following:
[  267.161043] do_IRQ: 7.33 No irq handler for vector migration/7
[  267.161045] CPU: 7 PID: 52 Comm: migration/7 Not tainted 4.15.0-rc7+ #27
[  267.161045] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017
[  267.161046] Call Trace:
[  267.161047]  <IRQ>
[  267.161052]  dump_stack+0x7c/0xb5
[  267.161054]  do_IRQ+0xb9/0xf0
[  267.161056]  common_interrupt+0xa2/0xa2
[  267.161057]  </IRQ>
[  267.161059] RIP: 0010:multi_cpu_stop+0xb0/0x120
[  267.161060] RSP: 0018:ffffbb6c81af7e70 EFLAGS: 00000202 ORIG_RAX: ffffffffffffffde
[  267.161061] RAX: 0000000000000001 RBX: 0000000000000004 RCX: 0000000000000000
[  267.161062] RDX: 0000000000000006 RSI: ffffffff898c4591 RDI: 0000000000000202
[  267.161063] RBP: ffffbb6c826e7c88 R08: ffff991abc1256bc R09: 0000000000000005
[  267.161063] R10: ffffbb6c81af7db8 R11: ffffffff89c91d20 R12: 0000000000000001
[  267.161064] R13: ffffbb6c826e7cac R14: 0000000000000003 R15: 0000000000000000
[  267.161067]  ? cpu_stop_queue_work+0x90/0x90
[  267.161068]  cpu_stopper_thread+0x83/0x100
[  267.161070]  smpboot_thread_fn+0x161/0x220
[  267.161072]  kthread+0xf5/0x130
[  267.161073]  ? sort_range+0x20/0x20
[  267.161074]  ? kthread_associate_blkcg+0xe0/0xe0
[  267.161076]  ret_from_fork+0x24/0x30

The irq just occurred after the irq is enabled in multi_cpu_stop.

0xffffffff8112d655 is in multi_cpu_stop (/home/will/u04/source_code/linux-block/kernel/stop_machine.c:223).
218				 */
219				touch_nmi_watchdog();
220			}
221		} while (curstate != MULTI_STOP_EXIT);
222	
223		local_irq_restore(flags);
224		return err;
225	}


Thanks
Jianchao

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU
  2018-01-19  3:05                       ` jianchao.wang
@ 2018-01-26  9:31                         ` Ming Lei
  0 siblings, 0 replies; 8+ messages in thread
From: Ming Lei @ 2018-01-26  9:31 UTC (permalink / raw)


Hi Jianchao,

On Fri, Jan 19, 2018@11:05:35AM +0800, jianchao.wang wrote:
> Hi ming
> 
> Sorry for delayed report this.
> 
> On 01/17/2018 05:57 PM, Ming Lei wrote:
> > 2) hctx->next_cpu can become offline from online before __blk_mq_run_hw_queue
> > is run, there isn't warning, but once the IO is submitted to hardware,
> > after it is completed, how does the HBA/hw queue notify CPU since CPUs
> > assigned to this hw queue(irq vector) are offline? blk-mq's timeout
> > handler may cover that, but looks too tricky.
> 
> In theory, the irq affinity will be migrated to other cpu. This is done by

Yes, but the other CPU should belong to this irq's affinity, and if all
CPUs in the irq's affinity is DEAD, this irq vector will be shutdown,
and if there is in-flight IO or will be, then the completion for this
IOs won't be delivered to CPUs. And now seems we depend on queue's timeout
handler to handle them.

> fixup_irqs() in the context of stop_machine.


> However, in my test, I found this log:
> 
> [  267.161043] do_IRQ: 7.33 No irq handler for vector
> 
> The 33 is the vector used by nvme cq.
> The irq seems to be missed and sometimes IO hang occurred.

As I mentioned above, it shouldn't be strange to see in CPU offline/online
stress test.


-- 
Ming

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-01-26  9:31 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20180112025306.28004-1-ming.lei@redhat.com>
     [not found] ` <20180112025306.28004-3-ming.lei@redhat.com>
     [not found]   ` <0d36c16b-cb4b-6088-fdf3-2fe5d8f33cd7@oracle.com>
     [not found]     ` <20180116121010.GA26429@ming.t460p>
     [not found]       ` <7c24e321-2d3b-cdec-699a-f58c34300aa9@oracle.com>
     [not found]         ` <20180116153248.GA3018@ming.t460p>
     [not found]           ` <7f5bad86-febc-06fc-67c0-393777d172e4@oracle.com>
     [not found]             ` <20180117035159.GA9487@ming.t460p>
     [not found]               ` <8c8efce8-ea02-0a9e-8369-44c885f4731d@oracle.com>
2018-01-17  6:22                 ` [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU Ming Lei
2018-01-17  8:09                   ` jianchao.wang
2018-01-17  9:57                     ` Ming Lei
2018-01-17 10:07                       ` Christian Borntraeger
2018-01-17 10:14                         ` Christian Borntraeger
2018-01-17 10:17                         ` Ming Lei
2018-01-19  3:05                       ` jianchao.wang
2018-01-26  9:31                         ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).