* bad IOPS when running multiple btest/fio in parallel
@ 2018-10-10 21:52 Yao Lin
2018-10-15 7:55 ` Sagi Grimberg
0 siblings, 1 reply; 8+ messages in thread
From: Yao Lin @ 2018-10-10 21:52 UTC (permalink / raw)
Host: Ubuntu 18.04 (4.15 kernel). I9-7940X (14C/28T) with 32G DRAM. Has a single-port 100G rNIC. No OFED driver is installed.
1. When I insert 4 Intel Optane 905P into that host and run 4 btest in parallel (one btest for each Optane, random read, bs=4K, 6 thread, qd = 32), I am able to get aggregated IOPS of 2380K.
2. Then I move those 4 Optane into 4 NVMeOF targets (RoCEv2). Each target has a 25G rNIC. All 4 25G rNICs and that 100G rNIC are connected to a switch.
3. Start iperf from all 4 targets toward the host, the aggregated throughput is 92Gbps. So this means the data path between the host and the targets is clean.
4. From the host, use "nvme connect" to link up with all 4 targets.
5. Run non-overlapping btest against each target, IOPS is around 595K each. So this is good.
6. Run 4 btest in parallel (one btest for each target). This is basically the same as #1, except it's now over the fabric. But the aggregate IOPS is only 1500K. Assign CPU affinity so that each btest uses exclusive 3C/6T doesn't help. Replacing btest by fio doesn't help either.
7. Replace that 100G rNIC by a model from a different vendor and repeat test #6. The aggregated IOPS is better, but it's still nowhere close to the expected 2380K IOPS.
So I am wondering if there is any known limitation with Linux inbox NVMeOF driver regarding support of multiple sessions in parallel. Any tuning?
Thanks,
Yao
^ permalink raw reply [flat|nested] 8+ messages in thread
* bad IOPS when running multiple btest/fio in parallel
@ 2018-10-12 4:44 Yao Lin
2018-10-12 14:39 ` Keith Busch
2018-10-12 15:49 ` Bart Van Assche
0 siblings, 2 replies; 8+ messages in thread
From: Yao Lin @ 2018-10-12 4:44 UTC (permalink / raw)
Today I changed to a much simpler setup and the same issue persists.
Directly connect 2 PCs (identical hardware) with a pair of 100G rNICs. Create a null block device on the target PC and configure it as the NVMeOF target. So, there is no switch or SSD in this setup. And this is a single FIO, not the 4 FIO in parallel I mentioned earlier.
Start fio test against that null block device from the host, the best IOPS is 1550K. That's the best IOPS after I try out many different QD, # of job, and CPU affinity setting. Run the same fio test on the target, I get 2250K IOPS (it jumps to 3650K when I increased the number of threads). ?
So it seems to me that Linux NVMe stack is quite good and can support 100Gb/s + throughput. But the same can not be said of the NVMeOF stack. Any tuning possible?
^ permalink raw reply [flat|nested] 8+ messages in thread
* bad IOPS when running multiple btest/fio in parallel
2018-10-12 4:44 bad IOPS when running multiple btest/fio in parallel Yao Lin
@ 2018-10-12 14:39 ` Keith Busch
2018-10-12 15:37 ` [EXT] " Yao Lin
2018-10-12 15:49 ` Bart Van Assche
1 sibling, 1 reply; 8+ messages in thread
From: Keith Busch @ 2018-10-12 14:39 UTC (permalink / raw)
On Fri, Oct 12, 2018@04:44:22AM +0000, Yao Lin wrote:
> Today I changed to a much simpler setup and the same issue persists.
>
> Directly connect 2 PCs (identical hardware) with a pair of 100G rNICs. Create a null block device on the target PC and configure it as the NVMeOF target. So, there is no switch or SSD in this setup. And this is a single FIO, not the 4 FIO in parallel I mentioned earlier.
>
> Start fio test against that null block device from the host, the best IOPS is 1550K. That's the best IOPS after I try out many different QD, # of job, and CPU affinity setting. Run the same fio test on the target, I get 2250K IOPS (it jumps to 3650K when I increased the number of threads). ?
>
> So it seems to me that Linux NVMe stack is quite good and can support 100Gb/s + throughput. But the same can not be said of the NVMeOF stack. Any tuning possible?
You're sure it's the software stack? Need to check your CPU utilization to
see if that's a possibility.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [EXT] Re: bad IOPS when running multiple btest/fio in parallel
2018-10-12 14:39 ` Keith Busch
@ 2018-10-12 15:37 ` Yao Lin
0 siblings, 0 replies; 8+ messages in thread
From: Yao Lin @ 2018-10-12 15:37 UTC (permalink / raw)
I monitor the CPU usage during all these tests. I have a powerful CPU (i9-7940X) and none of its cores ever reach 80% load.
-----Original Message-----
From: Keith Busch [mailto:keith.busch@intel.com]
Sent: Friday, October 12, 2018 7:39 AM
To: Yao Lin <yaolin at marvell.com>
Cc: linux-nvme at lists.infradead.org
Subject: [EXT] Re: bad IOPS when running multiple btest/fio in parallel
External Email
----------------------------------------------------------------------
On Fri, Oct 12, 2018@04:44:22AM +0000, Yao Lin wrote:
> Today I changed to a much simpler setup and the same issue persists.
>
> Directly connect 2 PCs (identical hardware) with a pair of 100G rNICs. Create a null block device on the target PC and configure it as the NVMeOF target. So, there is no switch or SSD in this setup. And this is a single FIO, not the 4 FIO in parallel I mentioned earlier.
>
> Start fio test against that null block device from the host, the best
> IOPS is 1550K. That's the best IOPS after I try out many different QD,
> # of job, and CPU affinity setting. Run the same fio test on the
> target, I get 2250K IOPS (it jumps to 3650K when I increased the
> number of threads). ?
>
> So it seems to me that Linux NVMe stack is quite good and can support 100Gb/s + throughput. But the same can not be said of the NVMeOF stack. Any tuning possible?
You're sure it's the software stack? Need to check your CPU utilization to see if that's a possibility.
^ permalink raw reply [flat|nested] 8+ messages in thread
* bad IOPS when running multiple btest/fio in parallel
2018-10-12 4:44 bad IOPS when running multiple btest/fio in parallel Yao Lin
2018-10-12 14:39 ` Keith Busch
@ 2018-10-12 15:49 ` Bart Van Assche
2018-10-12 16:02 ` [EXT] " Yao Lin
1 sibling, 1 reply; 8+ messages in thread
From: Bart Van Assche @ 2018-10-12 15:49 UTC (permalink / raw)
On Fri, 2018-10-12@04:44 +0000, Yao Lin wrote:
> Today I changed to a much simpler setup and the same issue persists.
>
> Directly connect 2 PCs (identical hardware) with a pair of 100G rNICs.
> Create a null block device on the target PC and configure it as the
> NVMeOF target. So, there is no switch or SSD in this setup. And this is
> a single FIO, not the 4 FIO in parallel I mentioned earlier.
>
> Start fio test against that null block device from the host, the best
> IOPS is 1550K. That's the best IOPS after I try out many different QD,
> # of job, and CPU affinity setting. Run the same fio test on the target,
> I get 2250K IOPS (it jumps to 3650K when I increased the number of
> threads).
>
> So it seems to me that Linux NVMe stack is quite good and can support
> 100Gb/s + throughput. But the same can not be said of the NVMeOF stack.
> Any tuning possible?
Many high-speed network adapters need multiple connections between
initiator and target to achieve line rate (typically 2-4 connections).
>From the NVMeOF initiator driver:
set->nr_hw_queues = nctrl->queue_count - 1;
I think the "queue_count" parameter can be configured when creating a
connection. From the drivers/nvme/host/fabrics.c source file:
static const match_table_t opt_tokens = {
[ ... ]
{ NVMF_OPT_NR_IO_QUEUES, "nr_io_queues=%d" },
[ ... ]
};
Have you tried to modify the nr_io_queues parameter? Have you verified
whether the 100G NICs you are using allocate multiple MSI/X vectors and
whether each vector has been assigned to another CPU?
Bart.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [EXT] Re: bad IOPS when running multiple btest/fio in parallel
2018-10-12 15:49 ` Bart Van Assche
@ 2018-10-12 16:02 ` Yao Lin
2018-10-15 7:50 ` Sagi Grimberg
0 siblings, 1 reply; 8+ messages in thread
From: Yao Lin @ 2018-10-12 16:02 UTC (permalink / raw)
Thanks Bart. In my original post, I list the performance from 2 different 100G NICs. I worked with the engineer for the NIC that performs better. Their driver does support large number of IRQ which are assigned to all 28 CPUs in a round-robin manner. But even with this design, that NIC can hit only 76Gb/s for RoCEv2 traffic.
I haven't got the response from the other NIC vendor. Their RoCEv2 throughput has never exceed 55Gb/s. I will take a look at the source code.
-----Original Message-----
From: Bart Van Assche [mailto:bvanassche@acm.org]
Sent: Friday, October 12, 2018 8:49 AM
To: Yao Lin ; linux-nvme at lists.infradead.org
Subject: [EXT] Re: bad IOPS when running multiple btest/fio in parallel
External Email
----------------------------------------------------------------------
On Fri, 2018-10-12@04:44 +0000, Yao Lin wrote:
> Today I changed to a much simpler setup and the same issue persists.
>
> Directly connect 2 PCs (identical hardware) with a pair of 100G rNICs.
> Create a null block device on the target PC and configure it as the
> NVMeOF target. So, there is no switch or SSD in this setup. And this
> is a single FIO, not the 4 FIO in parallel I mentioned earlier.
>
> Start fio test against that null block device from the host, the best
> IOPS is 1550K. That's the best IOPS after I try out many different QD,
> # of job, and CPU affinity setting. Run the same fio test on the
> target, I get 2250K IOPS (it jumps to 3650K when I increased the
> number of threads).
>
> So it seems to me that Linux NVMe stack is quite good and can support
> 100Gb/s + throughput. But the same can not be said of the NVMeOF stack.
> Any tuning possible?
Many high-speed network adapters need multiple connections between initiator and target to achieve line rate (typically 2-4 connections).
>From the NVMeOF initiator driver:
set->nr_hw_queues = nctrl->queue_count - 1;
I think the "queue_count" parameter can be configured when creating a connection. From the drivers/nvme/host/fabrics.c source file:
static const match_table_t opt_tokens = {
[ ... ]
{ NVMF_OPT_NR_IO_QUEUES, "nr_io_queues=%d" },
[ ... ]
};
Have you tried to modify the nr_io_queues parameter? Have you verified whether the 100G NICs you are using allocate multiple MSI/X vectors and whether each vector has been assigned to another CPU?
Bart.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [EXT] Re: bad IOPS when running multiple btest/fio in parallel
2018-10-12 16:02 ` [EXT] " Yao Lin
@ 2018-10-15 7:50 ` Sagi Grimberg
0 siblings, 0 replies; 8+ messages in thread
From: Sagi Grimberg @ 2018-10-15 7:50 UTC (permalink / raw)
> Thanks Bart. In my original post, I list the performance from 2 different 100G NICs. I worked with the engineer for the NIC that performs better. Their driver does support large number of IRQ which are assigned to all 28 CPUs in a round-robin manner. But even with this design, that NIC can hit only 76Gb/s for RoCEv2 traffic.
>
> I haven't got the response from the other NIC vendor. Their RoCEv2 throughput has never exceed 55Gb/s. I will take a look at the source code.
What kernel version are you running?
Do you happen to run irq balancer?
^ permalink raw reply [flat|nested] 8+ messages in thread
* bad IOPS when running multiple btest/fio in parallel
2018-10-10 21:52 Yao Lin
@ 2018-10-15 7:55 ` Sagi Grimberg
0 siblings, 0 replies; 8+ messages in thread
From: Sagi Grimberg @ 2018-10-15 7:55 UTC (permalink / raw)
> Host: Ubuntu 18.04 (4.15 kernel). I9-7940X (14C/28T) with 32G DRAM. Has a single-port 100G rNIC. No OFED driver is installed.
>
> 1. When I insert 4 Intel Optane 905P into that host and run 4 btest in parallel (one btest for each Optane, random read, bs=4K, 6 thread, qd = 32), I am able to get aggregated IOPS of 2380K.
> 2. Then I move those 4 Optane into 4 NVMeOF targets (RoCEv2). Each target has a 25G rNIC. All 4 25G rNICs and that 100G rNIC are connected to a switch.
> 3. Start iperf from all 4 targets toward the host, the aggregated throughput is 92Gbps. So this means the data path between the host and the targets is clean.
> 4. From the host, use "nvme connect" to link up with all 4 targets.
> 5. Run non-overlapping btest against each target, IOPS is around 595K each. So this is good.
> 6. Run 4 btest in parallel (one btest for each target). This is basically the same as #1, except it's now over the fabric. But the aggregate IOPS is only 1500K. Assign CPU affinity so that each btest uses exclusive 3C/6T doesn't help. Replacing btest by fio doesn't help either.
> 7. Replace that 100G rNIC by a model from a different vendor and repeat test #6. The aggregated IOPS is better, but it's still nowhere close to the expected 2380K IOPS.
>
> So I am wondering if there is any known limitation with Linux inbox NVMeOF driver regarding support of multiple sessions in parallel. Any tuning?
Does setting modparam register_always=Y make a difference?
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2018-10-15 7:55 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-10-12 4:44 bad IOPS when running multiple btest/fio in parallel Yao Lin
2018-10-12 14:39 ` Keith Busch
2018-10-12 15:37 ` [EXT] " Yao Lin
2018-10-12 15:49 ` Bart Van Assche
2018-10-12 16:02 ` [EXT] " Yao Lin
2018-10-15 7:50 ` Sagi Grimberg
-- strict thread matches above, loose matches on Subject: below --
2018-10-10 21:52 Yao Lin
2018-10-15 7:55 ` Sagi Grimberg
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.