All of lore.kernel.org
 help / color / mirror / Atom feed
* NVMe scalability issue
@ 2015-06-01 22:52 Ming Lin
  2015-06-01 23:02 ` Keith Busch
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Ming Lin @ 2015-06-01 22:52 UTC (permalink / raw)


Hi list,

I'm playing with 8 high performance NVMe devices on a 4 sockets server.
Each device can get 730K 4k read IOPS.

Kernel: 4.1-rc3
fio test shows it doesn't scale well with 4 or more devices.
I wonder any possible direction to improve it.

devices		theory		actual
		IOPS(K)		IOPS(K)
-------		-------		-------
1		733		733
2		1466		1446.8
3		2199		2174.5
4		2932		2354.9
5		3665		3024.5
6		4398		3818.9
7		5131		4526.3
8		5864		4621.2

And a graph here:
http://minggr.net/pub/20150601/nvme-scalability.jpg


With 8 devices, CPU is still 43% idle, so CPU is not the bottleneck.

"top" data

Tasks: 565 total,  30 running, 535 sleeping,   0 stopped,   0 zombie
%Cpu(s): 17.5 us, 39.2 sy,  0.0 ni, 43.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  52833033+total,  3103032 used, 52522732+free,    18472 buffers
KiB Swap:  7999484 total,        0 used,  7999484 free.  1506732 cached Mem

"perf top" data

   PerfTop:  124581 irqs/sec  kernel:78.6%  exact:  0.0% [4000Hz cycles],  (all, 48 CPUs)
-----------------------------------------------------------------------------------------

     3.30%  [kernel]       [k] do_blockdev_direct_IO      
     2.99%  fio            [.] get_io_u                   
     2.79%  fio            [.] axmap_isset                
     2.40%  [kernel]       [k] irq_entries_start          
     1.91%  [kernel]       [k] _raw_spin_lock             
     1.77%  [kernel]       [k] nvme_process_cq            
     1.73%  [kernel]       [k] _raw_spin_lock_irqsave     
     1.71%  fio            [.] fio_gettime                
     1.33%  [kernel]       [k] blk_account_io_start       
     1.24%  [kernel]       [k] blk_account_io_done        
     1.23%  [kernel]       [k] kmem_cache_alloc           
     1.23%  [kernel]       [k] nvme_queue_rq              
     1.22%  fio            [.] io_u_queued_complete       
     1.14%  [kernel]       [k] native_read_tsc            
     1.11%  [kernel]       [k] kmem_cache_free            
     1.05%  [kernel]       [k] __acct_update_integrals    
     1.01%  [kernel]       [k] context_tracking_exit      
     0.94%  [kernel]       [k] _raw_spin_unlock_irqrestore
     0.91%  [kernel]       [k] rcu_eqs_enter_common       
     0.86%  [kernel]       [k] cpuacct_account_field      
     0.84%  fio            [.] td_io_queue  

fio script

[global]
rw=randread
bs=4k
direct=1
ioengine=libaio
iodepth=64
time_based
runtime=60
group_reporting
numjobs=4

[job0]
filename=/dev/nvme0n1

[job1]
filename=/dev/nvme1n1

[job2]
filename=/dev/nvme2n1

[job3]
filename=/dev/nvme3n1

[job4]
filename=/dev/nvme4n1

[job5]
filename=/dev/nvme5n1

[job6]
filename=/dev/nvme6n1

[job7]
filename=/dev/nvme7n1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* NVMe scalability issue
  2015-06-01 22:52 NVMe scalability issue Ming Lin
@ 2015-06-01 23:02 ` Keith Busch
  2015-06-01 23:24   ` Ming Lin
  2015-06-01 23:28   ` Azher Mughal
  2015-06-02  7:58 ` Matias Bjørling
  2015-06-02 19:03 ` Andrey Kuzmin
  2 siblings, 2 replies; 13+ messages in thread
From: Keith Busch @ 2015-06-01 23:02 UTC (permalink / raw)


On Mon, 1 Jun 2015, Ming Lin wrote:
> Hi list,
>
> I'm playing with 8 high performance NVMe devices on a 4 sockets server.
> Each device can get 730K 4k read IOPS.
>
> Kernel: 4.1-rc3
> fio test shows it doesn't scale well with 4 or more devices.
> I wonder any possible direction to improve it.

There was a demo at SC'14 with a heck of a lot more NVMe drives than that,
and performance scaled quite linearly. Are your devices sharing PCI-e lanes?

You could try setting "cpus_allowed" on each job to the CPU's on the
socket local to the nvme device. That should get a measurable improvement,
and if your irq's are appropriately affinitized.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* NVMe scalability issue
  2015-06-01 23:02 ` Keith Busch
@ 2015-06-01 23:24   ` Ming Lin
  2015-06-02  3:30     ` Keith Busch
  2015-06-01 23:28   ` Azher Mughal
  1 sibling, 1 reply; 13+ messages in thread
From: Ming Lin @ 2015-06-01 23:24 UTC (permalink / raw)


On Mon, Jun 1, 2015@4:02 PM, Keith Busch <keith.busch@intel.com> wrote:
> On Mon, 1 Jun 2015, Ming Lin wrote:
>>
>> Hi list,
>>
>> I'm playing with 8 high performance NVMe devices on a 4 sockets server.
>> Each device can get 730K 4k read IOPS.
>>
>> Kernel: 4.1-rc3
>> fio test shows it doesn't scale well with 4 or more devices.
>> I wonder any possible direction to improve it.
>
>
> There was a demo at SC'14 with a heck of a lot more NVMe drives than that,
> and performance scaled quite linearly. Are your devices sharing PCI-e lanes?

Is there a way to check it via, for example, /sys?

> You could try setting "cpus_allowed" on each job to the CPU's on the
> socket local to the nvme device. That should get a measurable improvement,
> and if your irq's are appropriately affinitized.

How to know which socket is local to which nvme device?

I did a quick test with:
node0: nvme0 and nvme1
node1: nvme2 and nvme3
node2: nvme4 and nvme5
node3: nvme6 and nvme7

[job0]
filename=/dev/nvme0n1
cpus_allowed=0,4,8,12,16,20,24,28,32,36,40,44

[job1]
filename=/dev/nvme1n1
cpus_allowed=0,4,8,12,16,20,24,28,32,36,40,44

[job2]
filename=/dev/nvme2n1
cpus_allowed=1,5,9,13,17,21,25,29,33,37,41,45

[job3]
filename=/dev/nvme3n1
cpus_allowed=1,5,9,13,17,21,25,29,33,37,41,45

[job4]
filename=/dev/nvme4n1
cpus_allowed=2,6,10,14,18,22,26,30,34,38,42,46

[job5]
filename=/dev/nvme5n1
cpus_allowed=2,6,10,14,18,22,26,30,34,38,42,46

[job6]
filename=/dev/nvme6n1
cpus_allowed=3,7,11,15,19,23,27,31,35,39,43,47

[job7]
filename=/dev/nvme7n1
cpus_allowed=3,7,11,15,19,23,27,31,35,39,43,47

But it doesn't make much difference.

devices         theory          actual          actual
                IOPS(K)         IOPS(K)         IOPS(K) "cpus_allowed"
-------         -------            -------
-------------------------------
1               733             733                   733
2               1466            1446.8             1467.7
3               2199            2174.5             2213.9
4               2932            2354.9             2354.8
5               3665            3024.5             3085.2
6               4398            3818.9             3822.6
7               5131            4526.3             4517.8
8               5864            4621.2             4722.4

^ permalink raw reply	[flat|nested] 13+ messages in thread

* NVMe scalability issue
  2015-06-01 23:02 ` Keith Busch
  2015-06-01 23:24   ` Ming Lin
@ 2015-06-01 23:28   ` Azher Mughal
  1 sibling, 0 replies; 13+ messages in thread
From: Azher Mughal @ 2015-06-01 23:28 UTC (permalink / raw)


I ran some tests last year before SC using 8 drives in a SuperMicro
server. Please see attached. OS was CentOS 6.5 I think.

-Azher

On 6/1/2015 4:02 PM, Keith Busch wrote:
> On Mon, 1 Jun 2015, Ming Lin wrote:
>> Hi list,
>>
>> I'm playing with 8 high performance NVMe devices on a 4 sockets server.
>> Each device can get 730K 4k read IOPS.
>>
>> Kernel: 4.1-rc3
>> fio test shows it doesn't scale well with 4 or more devices.
>> I wonder any possible direction to improve it.
>
> There was a demo at SC'14 with a heck of a lot more NVMe drives than
> that,
> and performance scaled quite linearly. Are your devices sharing PCI-e
> lanes?
>
> You could try setting "cpus_allowed" on each job to the CPU's on the
> socket local to the nvme device. That should get a measurable
> improvement,
> and if your irq's are appropriately affinitized.
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 8drives-dd-SC9.PNG
Type: image/png
Size: 83980 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20150601/814c0441/attachment-0001.png>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* NVMe scalability issue
  2015-06-01 23:24   ` Ming Lin
@ 2015-06-02  3:30     ` Keith Busch
  2015-06-02 17:24       ` Ming Lin
  0 siblings, 1 reply; 13+ messages in thread
From: Keith Busch @ 2015-06-02  3:30 UTC (permalink / raw)


On Mon, 1 Jun 2015, Ming Lin wrote:
> On Mon, Jun 1, 2015@4:02 PM, Keith Busch <keith.busch@intel.com> wrote:
>> There was a demo at SC'14 with a heck of a lot more NVMe drives than that,
>> and performance scaled quite linearly. Are your devices sharing PCI-e lanes?
>
> Is there a way to check it via, for example, /sys?

   # lspci -tv

>> You could try setting "cpus_allowed" on each job to the CPU's on the
>> socket local to the nvme device. That should get a measurable improvement,
>> and if your irq's are appropriately affinitized.
>
> How to know which socket is local to which nvme device?

   # cat /sys/class/nvme/nvme<#>/device/numa_node

^ permalink raw reply	[flat|nested] 13+ messages in thread

* NVMe scalability issue
  2015-06-01 22:52 NVMe scalability issue Ming Lin
  2015-06-01 23:02 ` Keith Busch
@ 2015-06-02  7:58 ` Matias Bjørling
  2015-06-02 19:03 ` Andrey Kuzmin
  2 siblings, 0 replies; 13+ messages in thread
From: Matias Bjørling @ 2015-06-02  7:58 UTC (permalink / raw)


On 06/02/2015 12:52 AM, Ming Lin wrote:
> Hi list,
>
> [global]
> rw=randread
> bs=4k
> direct=1
> ioengine=libaio
> iodepth=64
> time_based
> runtime=60
> group_reporting
> numjobs=4
>
> [job0]
> filename=/dev/nvme0n1
>
> [job1]
> filename=/dev/nvme1n1
>
> [job2]
> filename=/dev/nvme2n1
>
> [job3]
> filename=/dev/nvme3n1
>
> [job4]
> filename=/dev/nvme4n1
>
> [job5]
> filename=/dev/nvme5n1
>
> [job6]
> filename=/dev/nvme6n1
>
> [job7]
> filename=/dev/nvme7n1
>

A wild guess, the jobs might run on a remote CPU compared to the device. 
Try to affinitize the jobs so they run on the CPU with the attached device.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* NVMe scalability issue
  2015-06-02  3:30     ` Keith Busch
@ 2015-06-02 17:24       ` Ming Lin
  2015-06-02 18:22         ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Ming Lin @ 2015-06-02 17:24 UTC (permalink / raw)


On Mon, Jun 1, 2015@8:30 PM, Keith Busch <keith.busch@intel.com> wrote:
> On Mon, 1 Jun 2015, Ming Lin wrote:
>>
>> On Mon, Jun 1, 2015@4:02 PM, Keith Busch <keith.busch@intel.com> wrote:
>>>
>>> There was a demo at SC'14 with a heck of a lot more NVMe drives than
>>> that,
>>> and performance scaled quite linearly. Are your devices sharing PCI-e
>>> lanes?
>>
>>
>> Is there a way to check it via, for example, /sys?
>
>
>   # lspci -tv

Each 4 drives share x16 lane.

>
>>> You could try setting "cpus_allowed" on each job to the CPU's on the
>>> socket local to the nvme device. That should get a measurable
>>> improvement,
>>> and if your irq's are appropriately affinitized.
>>
>>
>> How to know which socket is local to which nvme device?
>
>
>   # cat /sys/class/nvme/nvme<#>/device/numa_node

# grep . /sys/class/nvme/nvme*/device/numa_node
/sys/class/nvme/nvme0/device/numa_node:1
/sys/class/nvme/nvme1/device/numa_node:1
/sys/class/nvme/nvme2/device/numa_node:1
/sys/class/nvme/nvme3/device/numa_node:1
/sys/class/nvme/nvme4/device/numa_node:2
/sys/class/nvme/nvme5/device/numa_node:2
/sys/class/nvme/nvme6/device/numa_node:2
/sys/class/nvme/nvme7/device/numa_node:2

With correct numa_node binding, now I can get 5010K IOPS with 8 drives.
It's better now, but still not linear scaled to 5864K

I'll check if irq's are appropriately affinitized.

Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* NVMe scalability issue
  2015-06-02 17:24       ` Ming Lin
@ 2015-06-02 18:22         ` Jens Axboe
  2015-06-02 20:55           ` Ming Lin
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2015-06-02 18:22 UTC (permalink / raw)


On 06/02/2015 11:24 AM, Ming Lin wrote:
> On Mon, Jun 1, 2015@8:30 PM, Keith Busch <keith.busch@intel.com> wrote:
>> On Mon, 1 Jun 2015, Ming Lin wrote:
>>>
>>> On Mon, Jun 1, 2015@4:02 PM, Keith Busch <keith.busch@intel.com> wrote:
>>>>
>>>> There was a demo at SC'14 with a heck of a lot more NVMe drives than
>>>> that,
>>>> and performance scaled quite linearly. Are your devices sharing PCI-e
>>>> lanes?
>>>
>>>
>>> Is there a way to check it via, for example, /sys?
>>
>>
>>    # lspci -tv
>
> Each 4 drives share x16 lane.
>
>>
>>>> You could try setting "cpus_allowed" on each job to the CPU's on the
>>>> socket local to the nvme device. That should get a measurable
>>>> improvement,
>>>> and if your irq's are appropriately affinitized.
>>>
>>>
>>> How to know which socket is local to which nvme device?
>>
>>
>>    # cat /sys/class/nvme/nvme<#>/device/numa_node
>
> # grep . /sys/class/nvme/nvme*/device/numa_node
> /sys/class/nvme/nvme0/device/numa_node:1
> /sys/class/nvme/nvme1/device/numa_node:1
> /sys/class/nvme/nvme2/device/numa_node:1
> /sys/class/nvme/nvme3/device/numa_node:1
> /sys/class/nvme/nvme4/device/numa_node:2
> /sys/class/nvme/nvme5/device/numa_node:2
> /sys/class/nvme/nvme6/device/numa_node:2
> /sys/class/nvme/nvme7/device/numa_node:2
>
> With correct numa_node binding, now I can get 5010K IOPS with 8 drives.
> It's better now, but still not linear scaled to 5864K
>
> I'll check if irq's are appropriately affinitized.

Just a thought, but one thing that fio is pretty intensive on is time 
keeping. Depending on the platform, there's some shared state between 
the fio IO threads. Does the picture change if you add gtod_reduce=0?
In general, I'd also turn off strict random tracking. Either add 
'norandommap' as an option, or use random_generator=lfsr instead.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* NVMe scalability issue
  2015-06-01 22:52 NVMe scalability issue Ming Lin
  2015-06-01 23:02 ` Keith Busch
  2015-06-02  7:58 ` Matias Bjørling
@ 2015-06-02 19:03 ` Andrey Kuzmin
  2015-06-02 19:09   ` Jens Axboe
  2 siblings, 1 reply; 13+ messages in thread
From: Andrey Kuzmin @ 2015-06-02 19:03 UTC (permalink / raw)


On Tue, Jun 2, 2015@1:52 AM, Ming Lin <mlin@kernel.org> wrote:
> Hi list,
>
> I'm playing with 8 high performance NVMe devices on a 4 sockets server.
> Each device can get 730K 4k read IOPS.
>
> Kernel: 4.1-rc3
> fio test shows it doesn't scale well with 4 or more devices.
> I wonder any possible direction to improve it.
>
> devices         theory          actual
>                 IOPS(K)         IOPS(K)
> -------         -------         -------
> 1               733             733
> 2               1466            1446.8
> 3               2199            2174.5
> 4               2932            2354.9
> 5               3665            3024.5
> 6               4398            3818.9
> 7               5131            4526.3
> 8               5864            4621.2
>
> And a graph here:
> http://minggr.net/pub/20150601/nvme-scalability.jpg
>
>
> With 8 devices, CPU is still 43% idle, so CPU is not the bottleneck.
>
> "top" data
>
> Tasks: 565 total,  30 running, 535 sleeping,   0 stopped,   0 zombie
> %Cpu(s): 17.5 us, 39.2 sy,  0.0 ni, 43.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> KiB Mem:  52833033+total,  3103032 used, 52522732+free,    18472 buffers
> KiB Swap:  7999484 total,        0 used,  7999484 free.  1506732 cached Mem
>
> "perf top" data
>
>    PerfTop:  124581 irqs/sec  kernel:78.6%  exact:  0.0% [4000Hz cycles],  (all, 48 CPUs)
> -----------------------------------------------------------------------------------------
>
>      3.30%  [kernel]       [k] do_blockdev_direct_IO
>      2.99%  fio            [.] get_io_u
>      2.79%  fio            [.] axmap_isset

Just a thought as well, but axmap_isset cpu usage is suspiciously
high, given a read-only workload where it's essentially a noop.

Regards,
Andrey

>      2.40%  [kernel]       [k] irq_entries_start
>      1.91%  [kernel]       [k] _raw_spin_lock
>      1.77%  [kernel]       [k] nvme_process_cq
>      1.73%  [kernel]       [k] _raw_spin_lock_irqsave
>      1.71%  fio            [.] fio_gettime
>      1.33%  [kernel]       [k] blk_account_io_start
>      1.24%  [kernel]       [k] blk_account_io_done
>      1.23%  [kernel]       [k] kmem_cache_alloc
>      1.23%  [kernel]       [k] nvme_queue_rq
>      1.22%  fio            [.] io_u_queued_complete
>      1.14%  [kernel]       [k] native_read_tsc
>      1.11%  [kernel]       [k] kmem_cache_free
>      1.05%  [kernel]       [k] __acct_update_integrals
>      1.01%  [kernel]       [k] context_tracking_exit
>      0.94%  [kernel]       [k] _raw_spin_unlock_irqrestore
>      0.91%  [kernel]       [k] rcu_eqs_enter_common
>      0.86%  [kernel]       [k] cpuacct_account_field
>      0.84%  fio            [.] td_io_queue
>
> fio script
>
> [global]
> rw=randread
> bs=4k
> direct=1
> ioengine=libaio
> iodepth=64
> time_based
> runtime=60
> group_reporting
> numjobs=4
>
> [job0]
> filename=/dev/nvme0n1
>
> [job1]
> filename=/dev/nvme1n1
>
> [job2]
> filename=/dev/nvme2n1
>
> [job3]
> filename=/dev/nvme3n1
>
> [job4]
> filename=/dev/nvme4n1
>
> [job5]
> filename=/dev/nvme5n1
>
> [job6]
> filename=/dev/nvme6n1
>
> [job7]
> filename=/dev/nvme7n1
>
>
>
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 13+ messages in thread

* NVMe scalability issue
  2015-06-02 19:03 ` Andrey Kuzmin
@ 2015-06-02 19:09   ` Jens Axboe
  2015-06-02 19:11     ` Andrey Kuzmin
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2015-06-02 19:09 UTC (permalink / raw)


On 06/02/2015 01:03 PM, Andrey Kuzmin wrote:
> On Tue, Jun 2, 2015@1:52 AM, Ming Lin <mlin@kernel.org> wrote:
>> Hi list,
>>
>> I'm playing with 8 high performance NVMe devices on a 4 sockets server.
>> Each device can get 730K 4k read IOPS.
>>
>> Kernel: 4.1-rc3
>> fio test shows it doesn't scale well with 4 or more devices.
>> I wonder any possible direction to improve it.
>>
>> devices         theory          actual
>>                  IOPS(K)         IOPS(K)
>> -------         -------         -------
>> 1               733             733
>> 2               1466            1446.8
>> 3               2199            2174.5
>> 4               2932            2354.9
>> 5               3665            3024.5
>> 6               4398            3818.9
>> 7               5131            4526.3
>> 8               5864            4621.2
>>
>> And a graph here:
>> http://minggr.net/pub/20150601/nvme-scalability.jpg
>>
>>
>> With 8 devices, CPU is still 43% idle, so CPU is not the bottleneck.
>>
>> "top" data
>>
>> Tasks: 565 total,  30 running, 535 sleeping,   0 stopped,   0 zombie
>> %Cpu(s): 17.5 us, 39.2 sy,  0.0 ni, 43.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
>> KiB Mem:  52833033+total,  3103032 used, 52522732+free,    18472 buffers
>> KiB Swap:  7999484 total,        0 used,  7999484 free.  1506732 cached Mem
>>
>> "perf top" data
>>
>>     PerfTop:  124581 irqs/sec  kernel:78.6%  exact:  0.0% [4000Hz cycles],  (all, 48 CPUs)
>> -----------------------------------------------------------------------------------------
>>
>>       3.30%  [kernel]       [k] do_blockdev_direct_IO
>>       2.99%  fio            [.] get_io_u
>>       2.79%  fio            [.] axmap_isset
>
> Just a thought as well, but axmap_isset cpu usage is suspiciously
> high, given a read-only workload where it's essentially a noop.

Read or write doesn't matter, it's still marked in the random map. Both 
of them will maintain that state.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* NVMe scalability issue
  2015-06-02 19:09   ` Jens Axboe
@ 2015-06-02 19:11     ` Andrey Kuzmin
  2015-06-02 19:14       ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Andrey Kuzmin @ 2015-06-02 19:11 UTC (permalink / raw)


On Tue, Jun 2, 2015@10:09 PM, Jens Axboe <axboe@fb.com> wrote:
> On 06/02/2015 01:03 PM, Andrey Kuzmin wrote:
>>
>> On Tue, Jun 2, 2015@1:52 AM, Ming Lin <mlin@kernel.org> wrote:
>>>
>>> Hi list,
>>>
>>> I'm playing with 8 high performance NVMe devices on a 4 sockets server.
>>> Each device can get 730K 4k read IOPS.
>>>
>>> Kernel: 4.1-rc3
>>> fio test shows it doesn't scale well with 4 or more devices.
>>> I wonder any possible direction to improve it.
>>>
>>> devices         theory          actual
>>>                  IOPS(K)         IOPS(K)
>>> -------         -------         -------
>>> 1               733             733
>>> 2               1466            1446.8
>>> 3               2199            2174.5
>>> 4               2932            2354.9
>>> 5               3665            3024.5
>>> 6               4398            3818.9
>>> 7               5131            4526.3
>>> 8               5864            4621.2
>>>
>>> And a graph here:
>>> http://minggr.net/pub/20150601/nvme-scalability.jpg
>>>
>>>
>>> With 8 devices, CPU is still 43% idle, so CPU is not the bottleneck.
>>>
>>> "top" data
>>>
>>> Tasks: 565 total,  30 running, 535 sleeping,   0 stopped,   0 zombie
>>> %Cpu(s): 17.5 us, 39.2 sy,  0.0 ni, 43.3 id,  0.0 wa,  0.0 hi,  0.0 si,
>>> 0.0 st
>>> KiB Mem:  52833033+total,  3103032 used, 52522732+free,    18472 buffers
>>> KiB Swap:  7999484 total,        0 used,  7999484 free.  1506732 cached
>>> Mem
>>>
>>> "perf top" data
>>>
>>>     PerfTop:  124581 irqs/sec  kernel:78.6%  exact:  0.0% [4000Hz
>>> cycles],  (all, 48 CPUs)
>>>
>>> -----------------------------------------------------------------------------------------
>>>
>>>       3.30%  [kernel]       [k] do_blockdev_direct_IO
>>>       2.99%  fio            [.] get_io_u
>>>       2.79%  fio            [.] axmap_isset
>>
>>
>> Just a thought as well, but axmap_isset cpu usage is suspiciously
>> high, given a read-only workload where it's essentially a noop.
>
>
> Read or write doesn't matter, it's still marked in the random map. Both of
> them will maintain that state.
>

Not sure keeping track of blocks read was the intention in the test,
so it's worth rerunning with norandommap=1.


Regards,
Andrey

> --
> Jens Axboe
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* NVMe scalability issue
  2015-06-02 19:11     ` Andrey Kuzmin
@ 2015-06-02 19:14       ` Jens Axboe
  0 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2015-06-02 19:14 UTC (permalink / raw)


On 06/02/2015 01:11 PM, Andrey Kuzmin wrote:
> On Tue, Jun 2, 2015@10:09 PM, Jens Axboe <axboe@fb.com> wrote:
>> On 06/02/2015 01:03 PM, Andrey Kuzmin wrote:
>>>
>>> On Tue, Jun 2, 2015@1:52 AM, Ming Lin <mlin@kernel.org> wrote:
>>>>
>>>> Hi list,
>>>>
>>>> I'm playing with 8 high performance NVMe devices on a 4 sockets server.
>>>> Each device can get 730K 4k read IOPS.
>>>>
>>>> Kernel: 4.1-rc3
>>>> fio test shows it doesn't scale well with 4 or more devices.
>>>> I wonder any possible direction to improve it.
>>>>
>>>> devices         theory          actual
>>>>                   IOPS(K)         IOPS(K)
>>>> -------         -------         -------
>>>> 1               733             733
>>>> 2               1466            1446.8
>>>> 3               2199            2174.5
>>>> 4               2932            2354.9
>>>> 5               3665            3024.5
>>>> 6               4398            3818.9
>>>> 7               5131            4526.3
>>>> 8               5864            4621.2
>>>>
>>>> And a graph here:
>>>> http://minggr.net/pub/20150601/nvme-scalability.jpg
>>>>
>>>>
>>>> With 8 devices, CPU is still 43% idle, so CPU is not the bottleneck.
>>>>
>>>> "top" data
>>>>
>>>> Tasks: 565 total,  30 running, 535 sleeping,   0 stopped,   0 zombie
>>>> %Cpu(s): 17.5 us, 39.2 sy,  0.0 ni, 43.3 id,  0.0 wa,  0.0 hi,  0.0 si,
>>>> 0.0 st
>>>> KiB Mem:  52833033+total,  3103032 used, 52522732+free,    18472 buffers
>>>> KiB Swap:  7999484 total,        0 used,  7999484 free.  1506732 cached
>>>> Mem
>>>>
>>>> "perf top" data
>>>>
>>>>      PerfTop:  124581 irqs/sec  kernel:78.6%  exact:  0.0% [4000Hz
>>>> cycles],  (all, 48 CPUs)
>>>>
>>>> -----------------------------------------------------------------------------------------
>>>>
>>>>        3.30%  [kernel]       [k] do_blockdev_direct_IO
>>>>        2.99%  fio            [.] get_io_u
>>>>        2.79%  fio            [.] axmap_isset
>>>
>>>
>>> Just a thought as well, but axmap_isset cpu usage is suspiciously
>>> high, given a read-only workload where it's essentially a noop.
>>
>>
>> Read or write doesn't matter, it's still marked in the random map. Both of
>> them will maintain that state.
>>
>
> Not sure keeping track of blocks read was the intention in the test,
> so it's worth rerunning with norandommap=1.

Right, it doesn't matter for this test. But it's only a few percent of 
CPU, and should not impact scaling. I suspect the time keeping would be 
a bigger offender.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* NVMe scalability issue
  2015-06-02 18:22         ` Jens Axboe
@ 2015-06-02 20:55           ` Ming Lin
  0 siblings, 0 replies; 13+ messages in thread
From: Ming Lin @ 2015-06-02 20:55 UTC (permalink / raw)


On Tue, Jun 2, 2015@11:22 AM, Jens Axboe <axboe@fb.com> wrote:
> On 06/02/2015 11:24 AM, Ming Lin wrote:
>>
>> On Mon, Jun 1, 2015@8:30 PM, Keith Busch <keith.busch@intel.com> wrote:
>>>
>>> On Mon, 1 Jun 2015, Ming Lin wrote:
>>>>
>>>>
>>>> On Mon, Jun 1, 2015 at 4:02 PM, Keith Busch <keith.busch at intel.com>
>>>> wrote:
>>>>>
>>>>>
>>>>> There was a demo at SC'14 with a heck of a lot more NVMe drives than
>>>>> that,
>>>>> and performance scaled quite linearly. Are your devices sharing PCI-e
>>>>> lanes?
>>>>
>>>>
>>>>
>>>> Is there a way to check it via, for example, /sys?
>>>
>>>
>>>
>>>    # lspci -tv
>>
>>
>> Each 4 drives share x16 lane.
>>
>>>
>>>>> You could try setting "cpus_allowed" on each job to the CPU's on the
>>>>> socket local to the nvme device. That should get a measurable
>>>>> improvement,
>>>>> and if your irq's are appropriately affinitized.
>>>>
>>>>
>>>>
>>>> How to know which socket is local to which nvme device?
>>>
>>>
>>>
>>>    # cat /sys/class/nvme/nvme<#>/device/numa_node
>>
>>
>> # grep . /sys/class/nvme/nvme*/device/numa_node
>> /sys/class/nvme/nvme0/device/numa_node:1
>> /sys/class/nvme/nvme1/device/numa_node:1
>> /sys/class/nvme/nvme2/device/numa_node:1
>> /sys/class/nvme/nvme3/device/numa_node:1
>> /sys/class/nvme/nvme4/device/numa_node:2
>> /sys/class/nvme/nvme5/device/numa_node:2
>> /sys/class/nvme/nvme6/device/numa_node:2
>> /sys/class/nvme/nvme7/device/numa_node:2
>>
>> With correct numa_node binding, now I can get 5010K IOPS with 8 drives.
>> It's better now, but still not linear scaled to 5864K
>>
>> I'll check if irq's are appropriately affinitized.
>
>
> Just a thought, but one thing that fio is pretty intensive on is time
> keeping. Depending on the platform, there's some shared state between the
> fio IO threads. Does the picture change if you add gtod_reduce=0?
> In general, I'd also turn off strict random tracking. Either add
> 'norandommap' as an option, or use random_generator=lfsr instead.

I'll try it once the server is free.

It's 4 NUMA nodes with 8 NVMe drives.
With current installation, each 4 drives are local to one node and
share one PCIE 3.0 x16 lane.
I'll re-install it, so each 2 drives are local to one node and share
one x16 lane.

That will probably also help.

>
> --
> Jens Axboe
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-06-02 20:55 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-06-01 22:52 NVMe scalability issue Ming Lin
2015-06-01 23:02 ` Keith Busch
2015-06-01 23:24   ` Ming Lin
2015-06-02  3:30     ` Keith Busch
2015-06-02 17:24       ` Ming Lin
2015-06-02 18:22         ` Jens Axboe
2015-06-02 20:55           ` Ming Lin
2015-06-01 23:28   ` Azher Mughal
2015-06-02  7:58 ` Matias Bjørling
2015-06-02 19:03 ` Andrey Kuzmin
2015-06-02 19:09   ` Jens Axboe
2015-06-02 19:11     ` Andrey Kuzmin
2015-06-02 19:14       ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.