* setting nvme irq per cpu affinity in device driver
@ 2015-09-02 10:26 김경산
2015-09-02 14:05 ` Christoph Hellwig
2015-09-02 19:07 ` Keith Busch
0 siblings, 2 replies; 11+ messages in thread
From: 김경산 @ 2015-09-02 10:26 UTC (permalink / raw)
Hello.
Recently, we've experienced two bad conditions in the use of nvme ssd.
First one was, soft-lockup kernel warning has continually displayed when we
run fio test with high jobs(>32) on a SMP system(usually more than 32 CPUs).
Second one was, scalability has significantly decreased under multi SSD
device environment.
The more we use nvme SSDs in our test, the lower scalability has shown.
Those two were critical issue for us as it hinders to archive high IOPS.
We've investigated to find out the cause of the problem and we've found it.
The root cause was that the majority of interrupt handling has been
processed by a CPU, mostly by CPU0,
unlikely our expectation that an interrupt would be also handled by the
same CPU that handled SQ submission.
When we balanced IRQ processing over multi CPUs, both of the phenomenon has
disappeared and significantly improved.
Actually, in current status, device driver already tries to set
affinity_hint for IRQs during Q initialization.
But in our tests, it does not guarantee the CPU distribution on system-wide
even with irqbalance daemon working, failing to resolve above issues.
Later I've thought that setting affinity by shell script, but come to know
that it has limitation to make it always work well.
So, we become thought that a clear way to solve the problems is setting the
nvme irq affinity from device driver by itself.
With this modification, we could archive high scalability under large IO
requests with multi SSD devices.
We think this can help those who want to expect high IOPS as we did.
As a result, we suggest the patch providing a new module option,
use_set_irq_affinity(default=0).
When it is enabled(=1), insmod nvme.ko use_set_irq_affinity=1,
nvme IRQ per CPU matching is proceeded in the process of Q initialization.
It finally effects on /proc/irq/$IRQNO/smp_affinity.
Of course, system administrator can change it later on purpose for some
reason.
We hope the snippet merge on mainstream. Please review the modification.
It is created from 4.2-rc6 nvme-core.c
--- nvme-core.c.426.org 2015-09-02 23:54:16.479746463 +0900
+++ nvme-core.c 2015-09-03 01:10:48.944251952 +0900
@@ -63,6 +63,14 @@
module_param(shutdown_timeout, byte, 0644);
MODULE_PARM_DESC(shutdown_timeout, "timeout in seconds for controller
shutdown");
+static int use_set_irq_affinity;
+module_param(use_set_irq_affinity, int, 0);
+MODULE_PARM_DESC(use_set_irq_affinity, "set irq affinity to assign CPU per
IRQ evenly");
+
+static int interrupt_coalescing_param;
+module_param(interrupt_coalescing_param, int, 0);
+MODULE_PARM_DESC(interrupt_coalescing_param, "interrupt coalescing
param(time/threshold : 0x00~0xFF");
+
static int nvme_major;
module_param(nvme_major, int, 0);
@@ -249,6 +257,29 @@
blk_mq_start_request(blk_mq_rq_from_pdu(cmd));
}
+static int nvme_set_irq_affinity(unsigned int irq, const struct cpumask
*mask, bool force)
+{
+ int ret;
+ unsigned long flags;
+ struct irq_desc *desc;
+ struct irq_data *data;
+ struct irq_chip *chip;
+
+ desc = irq_to_desc(irq);
+ if (!desc)
+ return -EINVAL;
+ data = irq_desc_get_irq_data(desc);
+ if(!data)
+ return -EINVAL;
+ chip = irq_data_get_irq_chip(data);
+ if(!chip)
+ return -EINVAL;
+ raw_spin_lock_irqsave(&desc->lock, flags);
+ ret = chip->irq_set_affinity(data, mask, force);
+ raw_spin_unlock_irqrestore(&desc->lock, flags);
+ return ret;
+}
+
static void *iod_get_private(struct nvme_iod *iod)
{
return (void *) (iod->private & ~0x1UL);
@@ -2839,13 +2866,19 @@
int i;
for (i = 0; i < dev->online_queues; i++) {
+ int cpu_id;
nvmeq = dev->queues[i];
- if (!nvmeq->tags || !(*nvmeq->tags))
+ if (!nvmeq)
continue;
- irq_set_affinity_hint(dev->entry[nvmeq->cq_vector].vector,
- blk_mq_tags_cpumask(*nvmeq->tags));
+ cpu_id = (i <= 1) ? 0 : i-1;
+ irq_set_affinity_hint(dev->entry[nvmeq-
>cq_vector].vector,get_cpu_mask(cpu_id));
+ if(use_set_irq_affinity){
+ dev_info(dev->dev,"set affinity(IRQ%d-
>CPU%d)\n",dev->entry[nvmeq->cq_vector].vector,cpu_id);
+ nvme_set_irq_affinity(dev->entry[nvmeq-
>cq_vector].vector,get_cpu_mask(cpu_id),false);
+ }
+
}
}
^ permalink raw reply [flat|nested] 11+ messages in thread* setting nvme irq per cpu affinity in device driver
2015-09-02 10:26 setting nvme irq per cpu affinity in device driver 김경산
@ 2015-09-02 14:05 ` Christoph Hellwig
2015-09-03 5:01 ` 김경산
2015-09-06 8:06 ` 김경산
2015-09-02 19:07 ` Keith Busch
1 sibling, 2 replies; 11+ messages in thread
From: Christoph Hellwig @ 2015-09-02 14:05 UTC (permalink / raw)
We'll need a proper API in the interrupt subsystem to set the affinity
instead of poking directly into the internals. The irq subsystem
maintainer already indicated he's fine with adding suh an API
in principle. Please go ahead and propose something similar to your
implementation, just as an exported API in kernel/irq instead of inside
the NVMe driver.
^ permalink raw reply [flat|nested] 11+ messages in thread
* setting nvme irq per cpu affinity in device driver
2015-09-02 14:05 ` Christoph Hellwig
@ 2015-09-03 5:01 ` 김경산
2015-09-06 8:06 ` 김경산
1 sibling, 0 replies; 11+ messages in thread
From: 김경산 @ 2015-09-03 5:01 UTC (permalink / raw)
Hi Christoph Hellwig,
Thank you for your comment.
I've fixed to call kernel API, irq_set_affinity(), assuming it will be
provided again.
Let me ask one thing.
My current approach is that assigning a CPU to an IRQ for a CQ interrupt by
retrieving cpu id with get_cpu_mask().
Do you think blk_mq_tags_cpumask() might work better?
module_param(shutdown_timeout, byte, 0644);
MODULE_PARM_DESC(shutdown_timeout, "timeout in seconds for controller
shutdown");
+static int use_set_irq_affinity;
+module_param(use_set_irq_affinity, int, 0);
+MODULE_PARM_DESC(use_set_irq_affinity, "set irq affinity to assign CPU per
IRQ evenly");
+
static int nvme_major;
module_param(nvme_major, int, 0);
@@ -249,6 +253,14 @@
blk_mq_start_request(blk_mq_rq_from_pdu(cmd));
}
+static nvme_set_irq_affinity(unsigned int irq, const struct cpumask *mask)
+{
+ int ret = 0;
+ /* call kernel API when provided */
+ /* ret = irq_set_affinity() */
+ return ret;
+}
+
static void *iod_get_private(struct nvme_iod *iod)
{
return (void *) (iod->private & ~0x1UL);
@@ -2839,13 +2851,19 @@
int i;
for (i = 0; i < dev->online_queues; i++) {
+ int cpu_id;
nvmeq = dev->queues[i];
- if (!nvmeq->tags || !(*nvmeq->tags))
+ if (!nvmeq)
continue;
- irq_set_affinity_hint(dev->entry[nvmeq->cq_vector].vector,
- blk_mq_tags_cpumask(*nvmeq->tags));
+ cpu_id = (i <= 1) ? 0 : i-1;
+ irq_set_affinity_hint(dev->entry[nvmeq-
>cq_vector].vector,get_cpu_mask(cpu_id));
+ if(use_set_irq_affinity){
+ dev_info(dev->dev,"set affinity(IRQ%d-
>CPU%d)\n",dev->entry[nvmeq->cq_vector].vector,cpu_id);
+ nvme_set_irq_affinity(dev->entry[nvmeq-
>cq_vector].vector,get_cpu_mask(cpu_id),false);
+ }
+
}
}
-----Original Message-----
From: Christoph Hellwig [mailto:hch@infradead.org]
Sent: Wednesday, September 02, 2015 11:05 PM
To: ??????
Cc: Linux-nvme at lists.infradead.org
Subject: Re: setting nvme irq per cpu affinity in device driver
We'll need a proper API in the interrupt subsystem to set the affinity
instead of poking directly into the internals. The irq subsystem
maintainer already indicated he's fine with adding suh an API in principle.
Please go ahead and propose something similar to your implementation, just
as an exported API in kernel/irq instead of inside the NVMe driver.
^ permalink raw reply [flat|nested] 11+ messages in thread* setting nvme irq per cpu affinity in device driver
2015-09-02 14:05 ` Christoph Hellwig
2015-09-03 5:01 ` 김경산
@ 2015-09-06 8:06 ` 김경산
2015-09-07 17:54 ` 'Christoph Hellwig'
2015-09-08 14:47 ` Keith Busch
1 sibling, 2 replies; 11+ messages in thread
From: 김경산 @ 2015-09-06 8:06 UTC (permalink / raw)
Hi Christoph Hellwig,
I'd like to know the plan to provide the API from irq sybsystem.
Let me kindly ask you to how can I get to know the status.
Do you think should I need to contact to the irq maintainer?
-----Original Message-----
From: Linux-nvme [mailto:linux-nvme-bounces@lists.infradead.org] On Behalf
Of Christoph Hellwig
Sent: Wednesday, September 02, 2015 11:05 PM
To: ??????
Cc: Linux-nvme at lists.infradead.org
Subject: Re: setting nvme irq per cpu affinity in device driver
We'll need a proper API in the interrupt subsystem to set the affinity
instead of poking directly into the internals. The irq subsystem
maintainer already indicated he's fine with adding suh an API in principle.
Please go ahead and propose something similar to your implementation, just
as an exported API in kernel/irq instead of inside the NVMe driver.
_______________________________________________
Linux-nvme mailing list
Linux-nvme at lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 11+ messages in thread
* setting nvme irq per cpu affinity in device driver
2015-09-06 8:06 ` 김경산
@ 2015-09-07 17:54 ` 'Christoph Hellwig'
2015-09-10 10:25 ` 김경산
2015-09-08 14:47 ` Keith Busch
1 sibling, 1 reply; 11+ messages in thread
From: 'Christoph Hellwig' @ 2015-09-07 17:54 UTC (permalink / raw)
On Sun, Sep 06, 2015@05:06:24PM +0900, ?????? wrote:
> Hi Christoph Hellwig,
>
> I'd like to know the plan to provide the API from irq sybsystem.
> Let me kindly ask you to how can I get to know the status.
> Do you think should I need to contact to the irq maintainer?
The plan in still vague. I'd suggest you kick start the discussion
by submitting a patch that adds the code you suggest into a helper
in kernel/irq/manage.c.
^ permalink raw reply [flat|nested] 11+ messages in thread
* setting nvme irq per cpu affinity in device driver
2015-09-07 17:54 ` 'Christoph Hellwig'
@ 2015-09-10 10:25 ` 김경산
0 siblings, 0 replies; 11+ messages in thread
From: 김경산 @ 2015-09-10 10:25 UTC (permalink / raw)
I've confirmed that current irq_set_affinity_hint() implementation has
already been fixed to set affinity internally.
When the patch that Keith Busch has summited merged, I believe we can close
this issue with no more modification in device driver.
Only suggestion is we need to remain somewhere in kernel document, to guide
system administrator to control irqbalance not to overwrite nvme affinity.
/* irq_set_affinity_hint() : manage.c */
int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m) {
unsigned long flags;
struct irq_desc *desc = irq_get_desc_lock(irq, &flags,
IRQ_GET_DESC_CHECK_GLOBAL);
if (!desc)
return -EINVAL;
desc->affinity_hint = m;
irq_put_desc_unlock(desc, flags);
/* set the initial affinity to prevent every interrupt being on
CPU0 */
if (m)
__irq_set_affinity(irq, m, false);
return 0;
}
commit e2e64a932556cdfae455497dbe94a8db151fc9fa
Author: Jesse Brandeburg <jesse.brandeburg at intel.com>
Date: Thu Dec 18 17:22:06 2014 -0800
genirq: Set initial affinity in irq_set_affinity_hint()
Problem:
The default behavior of the kernel is somewhat undesirable as all
requested interrupts end up on CPU0 after registration. A user can
run irqbalance daemon, or can manually configure smp_affinity via the
proc filesystem, but the default affinity of the interrupts for all
devices is always CPU zero, this can cause performance problems or
very heavy cpu use of only one core if not noticed and fixed by the
user.
Solution:
Enable the setting of the initial affinity directly when the driver
sets a hint.
This enabling means that kernel drivers can include an initial
affinity setting for the interrupt, instead of all interrupts starting
out life on CPU0. Of course if irqbalance is still running then the
interrupts will get moved as before.
This function is currently called by drivers in block, crypto,
infiniband, ethernet and scsi trees, but only a handful, so these will
be the devices affected by this change.
-----Original Message-----
From: Linux-nvme [mailto:linux-nvme-bounces@lists.infradead.org] On Behalf
Of 'Christoph Hellwig'
Sent: Tuesday, September 08, 2015 2:54 AM
To: ??????
Cc: 'Christoph Hellwig'; Linux-nvme at lists.infradead.org
Subject: Re: setting nvme irq per cpu affinity in device driver
On Sun, Sep 06, 2015@05:06:24PM +0900, ?????? wrote:
> Hi Christoph Hellwig,
>
> I'd like to know the plan to provide the API from irq sybsystem.
> Let me kindly ask you to how can I get to know the status.
> Do you think should I need to contact to the irq maintainer?
The plan in still vague. I'd suggest you kick start the discussion by
submitting a patch that adds the code you suggest into a helper in
kernel/irq/manage.c.
_______________________________________________
Linux-nvme mailing list
Linux-nvme at lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 11+ messages in thread
* setting nvme irq per cpu affinity in device driver
2015-09-06 8:06 ` 김경산
2015-09-07 17:54 ` 'Christoph Hellwig'
@ 2015-09-08 14:47 ` Keith Busch
2015-09-09 0:35 ` 김경산
1 sibling, 1 reply; 11+ messages in thread
From: Keith Busch @ 2015-09-08 14:47 UTC (permalink / raw)
On Sun, 6 Sep 2015, ??? wrote:
> Hi Christoph Hellwig,
>
> I'd like to know the plan to provide the API from irq sybsystem.
> Let me kindly ask you to how can I get to know the status.
> Do you think should I need to contact to the irq maintainer?
I think it'd be great to propose to the irq subsystem, but I think this
doesn't necessarilly solve the problem. Won't irqbalance still undo
whatever the driver requested?
^ permalink raw reply [flat|nested] 11+ messages in thread
* setting nvme irq per cpu affinity in device driver
2015-09-08 14:47 ` Keith Busch
@ 2015-09-09 0:35 ` 김경산
0 siblings, 0 replies; 11+ messages in thread
From: 김경산 @ 2015-09-09 0:35 UTC (permalink / raw)
Actually, That's true.
Even though driver sets affinity during queue initialization, irqbalance
overwrite the setting when it run.
So, My suggestion is remain this feature for those who need to archieve
high IOPS with a direction exposed like something below.
-disable irqbalance
-insmod nvme.ko use_set_irq_affinity=1
-optionally, enable irqbalance again with -i option lists, nvme irq lists
monitoring banned
-----Original Message-----
From: Linux-nvme [mailto:linux-nvme-bounces@lists.infradead.org] On Behalf
Of Keith Busch
Sent: Tuesday, September 08, 2015 11:48 PM
To: ???
Cc: 'Christoph Hellwig'; Linux-nvme at lists.infradead.org
Subject: RE: setting nvme irq per cpu affinity in device driver
On Sun, 6 Sep 2015, ??? wrote:
> Hi Christoph Hellwig,
>
> I'd like to know the plan to provide the API from irq sybsystem.
> Let me kindly ask you to how can I get to know the status.
> Do you think should I need to contact to the irq maintainer?
I think it'd be great to propose to the irq subsystem, but I think this
doesn't necessarilly solve the problem. Won't irqbalance still undo
whatever the driver requested?
^ permalink raw reply [flat|nested] 11+ messages in thread
* setting nvme irq per cpu affinity in device driver
2015-09-02 10:26 setting nvme irq per cpu affinity in device driver 김경산
2015-09-02 14:05 ` Christoph Hellwig
@ 2015-09-02 19:07 ` Keith Busch
2015-09-03 0:33 ` 김경산
1 sibling, 1 reply; 11+ messages in thread
From: Keith Busch @ 2015-09-02 19:07 UTC (permalink / raw)
On Wed, 2 Sep 2015, ??? wrote:
> Actually, in current status, device driver already tries to set
> affinity_hint for IRQs during Q initialization.
> But in our tests, it does not guarantee the CPU distribution on system-wide
> even with irqbalance daemon working, failing to resolve above issues.
Are you running irqbalance with '--hintpolicy=exact'? It Seems to do
the right thing over here.
^ permalink raw reply [flat|nested] 11+ messages in thread
* setting nvme irq per cpu affinity in device driver
2015-09-02 19:07 ` Keith Busch
@ 2015-09-03 0:33 ` 김경산
2015-09-03 14:14 ` Keith Busch
0 siblings, 1 reply; 11+ messages in thread
From: 김경산 @ 2015-09-03 0:33 UTC (permalink / raw)
Hello, Keith Busch.
Thank you for your opinion. Yes, we've already considered the approach.
However, there were two issues regarding the way.
1. In our tests, setting affinity hint is not working on 4.2-rc6 driver.
Please find below code. It is current implementation (4.2-rc6)for setting
affinity hint.
static void nvme_set_irq_hints(struct nvme_dev *dev) {
struct nvme_queue *nvmeq;
int i;
for (i = 0; i < dev->online_queues; i++) {
nvmeq = dev->queues[i];
if (!nvmeq->tags || !(*nvmeq->tags))
continue;
/*kyungsan : <- not reach here from second loop*/
irq_set_affinity_hint(dev->entry[nvmeq->cq_vector].vector,
blk_mq_tags_cpumask(*nvmeq->tags));
}
}
As tags information is not properly configured, setting affinity for IO CQs
is not applied.
It is only applied on Admin CQ.
2. Even with lower version of linux drivers which are able to set affinity
hint properly, we've noticed that '--hintpolicy=exact' does not work on
some linux distros such as debian.
Maybe there are some differences in irqbalance daemon regarding how to
handle hint information even with 'exact' policy.
Kindly Regards
Kyungsan Kim
-----Original Message-----
From: Linux-nvme [mailto:linux-nvme-bounces@lists.infradead.org] On Behalf
Of Keith Busch
Sent: Thursday, September 03, 2015 4:08 AM
To: ???
Cc: Linux-nvme at lists.infradead.org
Subject: Re: setting nvme irq per cpu affinity in device driver
On Wed, 2 Sep 2015, ??? wrote:
> Actually, in current status, device driver already tries to set
> affinity_hint for IRQs during Q initialization.
> But in our tests, it does not guarantee the CPU distribution on
> system-wide even with irqbalance daemon working, failing to resolve above
issues.
Are you running irqbalance with '--hintpolicy=exact'? It Seems to do the
right thing over here.
^ permalink raw reply [flat|nested] 11+ messages in thread* setting nvme irq per cpu affinity in device driver
2015-09-03 0:33 ` 김경산
@ 2015-09-03 14:14 ` Keith Busch
0 siblings, 0 replies; 11+ messages in thread
From: Keith Busch @ 2015-09-03 14:14 UTC (permalink / raw)
On Wed, 2 Sep 2015, ??? wrote:
> Hello, Keith Busch.
> Thank you for your opinion. Yes, we've already considered the approach.
>
> However, there were two issues regarding the way.
> 1. In our tests, setting affinity hint is not working on 4.2-rc6 driver.
> Please find below code. It is current implementation (4.2-rc6)for setting
> affinity hint.
>
> static void nvme_set_irq_hints(struct nvme_dev *dev) {
> struct nvme_queue *nvmeq;
> int i;
>
> for (i = 0; i < dev->online_queues; i++) {
> nvmeq = dev->queues[i];
>
> if (!nvmeq->tags || !(*nvmeq->tags))
> continue;
>
> /*kyungsan : <- not reach here from second loop*/
>
> irq_set_affinity_hint(dev->entry[nvmeq->cq_vector].vector,
> blk_mq_tags_cpumask(*nvmeq->tags));
> }
> }
>
> As tags information is not properly configured, setting affinity for IO CQs
> is not applied.
> It is only applied on Admin CQ.
Ok, that's a bug. We can't set the hints before the nvmeq's tagset hw
context initialized. That doesn't happen until after a request queue is
allocated, which happens aynchronously now, but it's an easy fix.
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2015-09-10 10:25 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-02 10:26 setting nvme irq per cpu affinity in device driver 김경산
2015-09-02 14:05 ` Christoph Hellwig
2015-09-03 5:01 ` 김경산
2015-09-06 8:06 ` 김경산
2015-09-07 17:54 ` 'Christoph Hellwig'
2015-09-10 10:25 ` 김경산
2015-09-08 14:47 ` Keith Busch
2015-09-09 0:35 ` 김경산
2015-09-02 19:07 ` Keith Busch
2015-09-03 0:33 ` 김경산
2015-09-03 14:14 ` Keith Busch
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox