Fio high IOPS measurement mistake

Flexible I/O Tester development
 help / color / mirror / Atom feed

* Fio high IOPS measurement mistake
@ 2016-03-01  5:17 Vladislav Bolkhovitin
  2016-03-01  6:01 ` Sitsofe Wheeler
  2016-03-02  8:26 ` Andrey Kuzmin
  0 siblings, 2 replies; 22+ messages in thread
From: Vladislav Bolkhovitin @ 2016-03-01  5:17 UTC (permalink / raw)
  To: fio

Hello,

I'm currently looking at one NVRAM device, and during fio tests noticed that each fio
thread consumes 30% of user space CPU. I'm using ioengine=libaio, buffered=0, sync=0
and direct=1, so user space CPU consumption should be virtually zero.

That 30% user CPU consumption makes me suspect that this is overhead for internal fio
housekeeping, i.e., scientifically speaking, fio instrumental measurement mistake (I
hope, I'm using correct English terms).

Can anybody comment it and suggest how to decrease this user space CPU consumption?

Here is my full fio job:

[global]
ioengine=libaio
buffered=0
sync=0
direct=1
randrepeat=1
softrandommap=1
rw=randread
bs=4k
filename=./nvram (it's a link to a block device)
exitall=1
thread=1
disable_lat=1
disable_slat=1
disable_clat=1
loops=10
iodepth=16

[file1]

[file2]

I'm working on million+ IOPS range.

Thanks,
Vlad

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fio high IOPS measurement mistake
  2016-03-01  5:17 Fio high IOPS measurement mistake Vladislav Bolkhovitin
@ 2016-03-01  6:01 ` Sitsofe Wheeler
  2016-03-02  4:25   ` Vladislav Bolkhovitin
  2016-03-02  8:26 ` Andrey Kuzmin
  1 sibling, 1 reply; 22+ messages in thread
From: Sitsofe Wheeler @ 2016-03-01  6:01 UTC (permalink / raw)
  To: Vladislav Bolkhovitin; +Cc: fio@vger.kernel.org

Hi,

On 1 March 2016 at 05:17, Vladislav Bolkhovitin <vst@vlnb.net> wrote:
> Hello,
>
> I'm currently looking at one NVRAM device, and during fio tests noticed that each fio
> thread consumes 30% of user space CPU. I'm using ioengine=libaio, buffered=0, sync=0
> and direct=1, so user space CPU consumption should be virtually zero.
>
> That 30% user CPU consumption makes me suspect that this is overhead for internal fio
> housekeeping, i.e., scientifically speaking, fio instrumental measurement mistake (I
> hope, I'm using correct English terms).
>
> Can anybody comment it and suggest how to decrease this user space CPU consumption?
>
> Here is my full fio job:
>
> [global]
> ioengine=libaio
> buffered=0
> sync=0
> direct=1
> randrepeat=1
> softrandommap=1
> rw=randread
> bs=4k
> filename=./nvram (it's a link to a block device)
> exitall=1
> thread=1
> disable_lat=1
> disable_slat=1
> disable_clat=1
> loops=10
> iodepth=16

You appear to be missing gtod_reduce
(https://github.com/axboe/fio/blob/fio-2.6/HOWTO#L1668 ) or
gettimeofday cpu pinning. You also aren't using batching
(https://github.com/axboe/fio/blob/fio-2.6/HOWTO#L815 ).

You may want to look at what fio settings your flash vendor recommends
for benchmarking purposes...

-- 
Sitsofe | http://sucs.org/~sits/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fio high IOPS measurement mistake
  2016-03-01  6:01 ` Sitsofe Wheeler
@ 2016-03-02  4:25   ` Vladislav Bolkhovitin
  2016-03-02  7:38     ` Sitsofe Wheeler
                       ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Vladislav Bolkhovitin @ 2016-03-02  4:25 UTC (permalink / raw)
  To: Sitsofe Wheeler, fio@vger.kernel.org

Hi,

Sitsofe Wheeler wrote on 02/29/2016 10:01 PM:
> Hi,
> 
> On 1 March 2016 at 05:17, Vladislav Bolkhovitin <vst@vlnb.net> wrote:
>> Hello,
>>
>> I'm currently looking at one NVRAM device, and during fio tests noticed that each fio
>> thread consumes 30% of user space CPU. I'm using ioengine=libaio, buffered=0, sync=0
>> and direct=1, so user space CPU consumption should be virtually zero.
>>
>> That 30% user CPU consumption makes me suspect that this is overhead for internal fio
>> housekeeping, i.e., scientifically speaking, fio instrumental measurement mistake (I
>> hope, I'm using correct English terms).
>>
>> Can anybody comment it and suggest how to decrease this user space CPU consumption?
>>
>> Here is my full fio job:
>>
>> [global]
>> ioengine=libaio
>> buffered=0
>> sync=0
>> direct=1
>> randrepeat=1
>> softrandommap=1
>> rw=randread
>> bs=4k
>> filename=./nvram (it's a link to a block device)
>> exitall=1
>> thread=1
>> disable_lat=1
>> disable_slat=1
>> disable_clat=1
>> loops=10
>> iodepth=16
> 
> You appear to be missing gtod_reduce
> (https://github.com/axboe/fio/blob/fio-2.6/HOWTO#L1668 ) or
> gettimeofday cpu pinning. You also aren't using batching
> (https://github.com/axboe/fio/blob/fio-2.6/HOWTO#L815 ).

Thanks, I tried them, but they did not make any significant difference. The biggest
difference I had was when I changed CPU governor to "performance". Now I have 20-25%
user space, measured by fio itself, it's coherent with top. Note, I'm considering
per-thread CPU consumption, to see it in top you need to press '1' (one line per each CPU).

I also tried to short circuit the sync engine by calling fio_io_end() directly from top
of fio_syncio_queue(), so no actual IO is done. The results were interesting enough to
publish here in details (%% are per job):

Jobs	IOPS(M)	%user	%sys
1	4.3	78	22
2	7.6	67	33
3	10.5	65	35
4	7.7	61	38
5	4.8	78	22
6	4.7	83	17
7	4.8	84	15

Results were very consistent between runs. CPU - 8 cores Intel Xeon E5-2667 v3 @
3.20GHz with 20M L3 cache and HT. Fio is the latest git.

Obviously, if fio had zero overhead, i.e. instrumental mistake, IOPS level in this test
should sky rocket to hundreds of millions IOPS to have few %% overhead on multi-million
IOPS measurements. But we only have 4.3M per thread and 10.5M overall, which are,
apparently, max fio is capable to measure in the current implementation doesn't matter
how fast the storage stack is (it simply doesn't have more CPU cycles to run).

Also, apparently, there is some lock contention for something inside fio, which is
severely limiting multi-jobs performance.

Interesting that gtod_cpu for the single job case decreased IOPS to 3.8M with the same
user/sys %%: 22/78. Explicit clocksource=cpu didn't make any difference.

Another observation is why the sys CPU consumption is so high, if TSC clock is used?
Apparently, it was not used ever despite of explicit clocksource=cpu.

I checked perf for the single job case, which is the most interesting case where to
start optimizing from, and it reported 69% of time was spent in clock_thread_fn() and
3.2% in memset(). The latter also rises question, why is memset for the READs test?
Apparently, this memset is on high IO path.

The full job file was:

[global]
ioengine=sync
buffered=0
sync=0
direct=1
randrepeat=0
norandommap
softrandommap=1
random_generator=lfsr /* does not really matter */
rw=randread
bs=4K /* it does not matter, since it's short circuit */
filename=./nvram /* does not matter */
exitall=1
thread=1
gtod_reduce=1
loops=10
iodepth=8 /* does not matter */

[file1]

[file2]

...

The consumed user space CPU roughly could be considered the instrumental mistake.
Generally speaking, we have 3 components: load generator, measurement infrastructure
("a gauge") and load processor (storage). The storage is the object whose performance
we are measuring by applying load from the load generator and using the measurement
infrastructure to get the results. Since the storage stack is entirely in the kernel,
what we can see as the user space CPU consumption is the aggregated load generator and
measurement infrastructure CPU consumption, i.e. fio overhead, i.e. instrumental
mistake. (Obviously, this is true only when CPU is the bottleneck as you can see in top
with one line per each CPU output, which is pretty much always true for high IOPS tests.)

Thus, I'm afraid, looks like currently fio, being a really great tool, has severe
limitations for high IOPS measurements, because it has too big internal load generation
and measurement overheads. It's like having a thermometer, which has mistake 0-infinity
depending from temperature you are measuring. If it's low enough, you will have 100%
accuracy, but if it's too high, it might start measuring something internal instead of
what it is supposed to measure. To be fair, all thermometers behave like this ;).
However, this analyze shows that fio accuracy significantly declining starting from few
hundreds K IOPS, where for me it has with libaio and my NVRAM card 22% overhead on 612K
IOPS (QD 8, single job). Adding more jobs increases IOPS up to the card's limit, but
the per thread overhead remains about the same.

Just a friendly analyze in a hope to improve the great tool. Multi-million IOPS storage
is coming, so this is important. Or did I miss anything?

> You may want to look at what fio settings your flash vendor recommends
> for benchmarking purposes...

Those where I started from. However, being a person with an experimental physics
background, I started from very basics: calibrating my tools to figure out instrumental
mistakes I have with them. My checks with fio led to this thread.

Thanks,
Vlad

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fio high IOPS measurement mistake
  2016-03-02  4:25   ` Vladislav Bolkhovitin
@ 2016-03-02  7:38     ` Sitsofe Wheeler
  2016-03-03  3:02       ` Vladislav Bolkhovitin
  2016-03-02 18:37     ` Elliott, Robert (Persistent Memory)
  2016-03-03  3:03     ` Vladislav Bolkhovitin
  2 siblings, 1 reply; 22+ messages in thread
From: Sitsofe Wheeler @ 2016-03-02  7:38 UTC (permalink / raw)
  To: Vladislav Bolkhovitin; +Cc: fio@vger.kernel.org

On 2 March 2016 at 04:25, Vladislav Bolkhovitin <vst@vlnb.net> wrote:
> Hi,
>
> Sitsofe Wheeler wrote on 02/29/2016 10:01 PM:
>> Hi,
>>
>> On 1 March 2016 at 05:17, Vladislav Bolkhovitin <vst@vlnb.net> wrote:
>>> Hello,
>>>
>>> I'm currently looking at one NVRAM device, and during fio tests noticed that each fio
>>> thread consumes 30% of user space CPU. I'm using ioengine=libaio, buffered=0, sync=0
>>> and direct=1, so user space CPU consumption should be virtually zero.
>>>
>>> That 30% user CPU consumption makes me suspect that this is overhead for internal fio
>>> housekeeping, i.e., scientifically speaking, fio instrumental measurement mistake (I
>>> hope, I'm using correct English terms).
>>>
>>> Can anybody comment it and suggest how to decrease this user space CPU consumption?
>>>
>>> Here is my full fio job:
>>>
>>> [global]
>>> ioengine=libaio
>>> buffered=0
>>> sync=0
>>> direct=1
>>> randrepeat=1
>>> softrandommap=1
>>> rw=randread
>>> bs=4k
>>> filename=./nvram (it's a link to a block device)
>>> exitall=1
>>> thread=1
>>> disable_lat=1
>>> disable_slat=1
>>> disable_clat=1
>>> loops=10
>>> iodepth=16
>>
>> You appear to be missing gtod_reduce
>> (https://github.com/axboe/fio/blob/fio-2.6/HOWTO#L1668 ) or
>> gettimeofday cpu pinning. You also aren't using batching
>> (https://github.com/axboe/fio/blob/fio-2.6/HOWTO#L815 ).
>
> Thanks, I tried them, but they did not make any significant difference. The biggest

There was no cpu reduction from setting both iodepth_batch and
iodepth_batch_complete together?

> difference I had was when I changed CPU governor to "performance". Now I have 20-25%
> user space, measured by fio itself, it's coherent with top. Note, I'm considering
> per-thread CPU consumption, to see it in top you need to press '1' (one line per each CPU).

Have you applied the points mentioned in
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Performance_Tuning_Guide/chap-Red_Hat_Enterprise_Linux-Performance_Tuning_Guide-Performance_Features_in_RednbspHat_EnterprisenbspLinuxnbsp7.html
? Things like the scheduler granularity (as changed by Red Hat's
tuned-adm) can have a large impact...

> The full job file was:
>
> [global]
> ioengine=sync
> buffered=0
> sync=0
> direct=1
[...]
> iodepth=8 /* does not matter */
^^^ It's worth noting that direct and iodepth don't really have an
impact on synchronous ioengines -
https://github.com/axboe/fio/blob/master/HOWTO#L804 .

-- 
Sitsofe | http://sucs.org/~sits/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fio high IOPS measurement mistake
  2016-03-02  7:38     ` Sitsofe Wheeler
@ 2016-03-03  3:02       ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 22+ messages in thread
From: Vladislav Bolkhovitin @ 2016-03-03  3:02 UTC (permalink / raw)
  To: Sitsofe Wheeler; +Cc: fio@vger.kernel.org

Sitsofe Wheeler wrote on 03/01/2016 11:38 PM:
> On 2 March 2016 at 04:25, Vladislav Bolkhovitin <vst@vlnb.net> wrote:
>> Hi,
>>
>> Sitsofe Wheeler wrote on 02/29/2016 10:01 PM:
>>> Hi,
>>>
>>> On 1 March 2016 at 05:17, Vladislav Bolkhovitin <vst@vlnb.net> wrote:
>>>> Hello,
>>>>
>>>> I'm currently looking at one NVRAM device, and during fio tests noticed that each fio
>>>> thread consumes 30% of user space CPU. I'm using ioengine=libaio, buffered=0, sync=0
>>>> and direct=1, so user space CPU consumption should be virtually zero.
>>>>
>>>> That 30% user CPU consumption makes me suspect that this is overhead for internal fio
>>>> housekeeping, i.e., scientifically speaking, fio instrumental measurement mistake (I
>>>> hope, I'm using correct English terms).
>>>>
>>>> Can anybody comment it and suggest how to decrease this user space CPU consumption?
>>>>
>>>> Here is my full fio job:
>>>>
>>>> [global]
>>>> ioengine=libaio
>>>> buffered=0
>>>> sync=0
>>>> direct=1
>>>> randrepeat=1
>>>> softrandommap=1
>>>> rw=randread
>>>> bs=4k
>>>> filename=./nvram (it's a link to a block device)
>>>> exitall=1
>>>> thread=1
>>>> disable_lat=1
>>>> disable_slat=1
>>>> disable_clat=1
>>>> loops=10
>>>> iodepth=16
>>>
>>> You appear to be missing gtod_reduce
>>> (https://github.com/axboe/fio/blob/fio-2.6/HOWTO#L1668 ) or
>>> gettimeofday cpu pinning. You also aren't using batching
>>> (https://github.com/axboe/fio/blob/fio-2.6/HOWTO#L815 ).
>>
>> Thanks, I tried them, but they did not make any significant difference. The biggest
> 
> There was no cpu reduction from setting both iodepth_batch and
> iodepth_batch_complete together?

No, no significant difference.

I also forgot to mention that I had tried gtod_cpu=11 as well, and it only made things
worse.

>> difference I had was when I changed CPU governor to "performance". Now I have 20-25%
>> user space, measured by fio itself, it's coherent with top. Note, I'm considering
>> per-thread CPU consumption, to see it in top you need to press '1' (one line per each CPU).
> 
> Have you applied the points mentioned in
> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Performance_Tuning_Guide/chap-Red_Hat_Enterprise_Linux-Performance_Tuning_Guide-Performance_Features_in_RednbspHat_EnterprisenbspLinuxnbsp7.html
> ? Things like the scheduler granularity (as changed by Red Hat's
> tuned-adm) can have a large impact...

Thanks for your help and feedback, but you are trying to improve my results, while I'm
asking how to _decrease fio overhead_ on high IOPS with libaio. It's very different
question.

>> The full job file was:
>>
>> [global]
>> ioengine=sync
>> buffered=0
>> sync=0
>> direct=1
> [...]
>> iodepth=8 /* does not matter */
> ^^^ It's worth noting that direct and iodepth don't really have an
> impact on synchronous ioengines -
> https://github.com/axboe/fio/blob/master/HOWTO#L804

Yes, sure, this is why I commented it as "doesn't matter".

Thanks,
Vlad

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: Fio high IOPS measurement mistake
  2016-03-02  4:25   ` Vladislav Bolkhovitin
  2016-03-02  7:38     ` Sitsofe Wheeler
@ 2016-03-02 18:37     ` Elliott, Robert (Persistent Memory)
  2016-03-03  3:03       ` Vladislav Bolkhovitin
  2016-03-03  3:03     ` Vladislav Bolkhovitin
  2 siblings, 1 reply; 22+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2016-03-02 18:37 UTC (permalink / raw)
  To: Vladislav Bolkhovitin, Sitsofe Wheeler, fio@vger.kernel.org

> -----Original Message-----
> From: fio-owner@vger.kernel.org [mailto:fio-owner@vger.kernel.org] On
> Behalf Of Vladislav Bolkhovitin
> Sent: Tuesday, March 01, 2016 10:25 PM
> To: Sitsofe Wheeler <sitsofe@gmail.com>; fio@vger.kernel.org
> Subject: Re: Fio high IOPS measurement mistake
> 
> Hi,
> 
...
> >> Here is my full fio job:
> >>
> >> [global]
> >> ioengine=libaio
> >> buffered=0
> >> sync=0
> >> direct=1
> >> randrepeat=1
> >> softrandommap=1
> >> rw=randread
> >> bs=4k
> >> filename=./nvram (it's a link to a block device)
> >> exitall=1
> >> thread=1
> >> disable_lat=1
> >> disable_slat=1
> >> disable_clat=1
> >> loops=10
> >> iodepth=16
> >
> > You appear to be missing gtod_reduce
> > (https://github.com/axboe/fio/blob/fio-2.6/HOWTO#L1668 ) or
> > gettimeofday cpu pinning. You also aren't using batching
> > (https://github.com/axboe/fio/blob/fio-2.6/HOWTO#L815 ).
> 
...
> Jobs	IOPS(M)	%user	%sys
> 1	4.3	78	22
> 2	7.6	67	33
> 3	10.5	65	35
> 4	7.7	61	38
> 5	4.8	78	22
> 6	4.7	83	17
> 7	4.8	84	15

Use cpus_allowed_policy=split to keep threads or processes
from wandering across CPUs.

If the system is NUMA, split the NVDIMMs based on that
and use cpus_allowed to just have threads access local
NVDIMMs.

As someone else mentioned, the random map overhead is
noticeable, especially for large capacity devices.  Use
norandommap and randrepeat=0.

There was a problem in 2014 requiring invalidate=0 for
small capacities; I don't recall if that was fixed.

If you're using the pmem driver, it is incapable of queuing
(submissions are immediately completed), so iodepth is
irrelevant.

There are other system and OS settings to tune, like: 
* use the performance governor. The userspace governor
spends a lot of time making acpi_os_write calls
* turn off auditing (audit=0 on the kernel command line)
* turn off C states

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fio high IOPS measurement mistake
  2016-03-02 18:37     ` Elliott, Robert (Persistent Memory)
@ 2016-03-03  3:03       ` Vladislav Bolkhovitin
  2016-03-03 21:03         ` Elliott, Robert (Persistent Memory)
  0 siblings, 1 reply; 22+ messages in thread
From: Vladislav Bolkhovitin @ 2016-03-03  3:03 UTC (permalink / raw)
  To: Elliott, Robert (Persistent Memory), Sitsofe Wheeler,
	fio@vger.kernel.org

Elliott, Robert (Persistent Memory) wrote on 03/02/2016 10:37 AM:
>>>> Here is my full fio job:
>>>>
>>>> [global]
>>>> ioengine=libaio
>>>> buffered=0
>>>> sync=0
>>>> direct=1
>>>> randrepeat=1
>>>> softrandommap=1
>>>> rw=randread
>>>> bs=4k
>>>> filename=./nvram (it's a link to a block device)
>>>> exitall=1
>>>> thread=1
>>>> disable_lat=1
>>>> disable_slat=1
>>>> disable_clat=1
>>>> loops=10
>>>> iodepth=16
>>>
>>> You appear to be missing gtod_reduce
>>> (https://github.com/axboe/fio/blob/fio-2.6/HOWTO#L1668 ) or
>>> gettimeofday cpu pinning. You also aren't using batching
>>> (https://github.com/axboe/fio/blob/fio-2.6/HOWTO#L815 ).
>>
> ...
>> Jobs	IOPS(M)	%user	%sys
>> 1	4.3	78	22
>> 2	7.6	67	33
>> 3	10.5	65	35
>> 4	7.7	61	38
>> 5	4.8	78	22
>> 6	4.7	83	17
>> 7	4.8	84	15
> 
> Use cpus_allowed_policy=split to keep threads or processes
> from wandering across CPUs.

Tried, no help, which is easily explainable, because I'm always running num jobs < num
CPUs.

> If the system is NUMA, split the NVDIMMs based on that
> and use cpus_allowed to just have threads access local
> NVDIMMs.

No, it's single socket.

> As someone else mentioned, the random map overhead is
> noticeable, especially for large capacity devices.  Use
> norandommap and randrepeat=0.

Had done, see the job file in my message you didn't quote.

> There was a problem in 2014 requiring invalidate=0 for
> small capacities; I don't recall if that was fixed.
> 
> If you're using the pmem driver, it is incapable of queuing
> (submissions are immediately completed), so iodepth is
> irrelevant.
> 
> There are other system and OS settings to tune, like: 
> * use the performance governor. The userspace governor
> spends a lot of time making acpi_os_write calls

Had done, see not quoted part of my e-mail.

> * turn off auditing (audit=0 on the kernel command line)
> * turn off C states

Tried too.

Overall, I appreciate your help, but again, question is not how to improve my results.
The question is how to _decrease fio overhead_ with libaio, see subject of this e-mail.
It's very different question.

Thanks,
Vlad


^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: Fio high IOPS measurement mistake
  2016-03-03  3:03       ` Vladislav Bolkhovitin
@ 2016-03-03 21:03         ` Elliott, Robert (Persistent Memory)
  2016-03-04  4:36           ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 22+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2016-03-03 21:03 UTC (permalink / raw)
  To: Vladislav Bolkhovitin, Sitsofe Wheeler, fio@vger.kernel.org

> -----Original Message-----
> From: Vladislav Bolkhovitin [mailto:vst@vlnb.net]
> Sent: Wednesday, March 2, 2016 9:03 PM
> To: Elliott, Robert (Persistent Memory) <elliott@hpe.com>; Sitsofe Wheeler
> <sitsofe@gmail.com>; fio@vger.kernel.org
> Subject: Re: Fio high IOPS measurement mistake
> 
...
> 
> Overall, I appreciate your help, but again, question is not how to improve
> my results.
> The question is how to _decrease fio overhead_ with libaio, see subject of
> this e-mail.
> It's very different question.
> 
> Thanks,
> Vlad

Here are some example results on one of my test systems with 4.4rc2, 
showing %usr around 19%.

This job file:
[global]
direct=1
ioengine=libaio
norandommap
randrepeat=0
bs=4k
iodepth=1       # irrelevant for pmem
runtime=600
time_based=1
group_reporting
thread
gtod_reduce=1   # reduce=1 except for latency test
zero_buffers
cpus_allowed_policy=split
numjobs=16

[drive_0]
filename=/dev/pmem0
cpus_allowed=0-63
rw=randread

[drive_1]
filename=/dev/pmem1
cpus_allowed=0-63
rw=randread

[drive_2]
filename=/dev/pmem2
cpus_allowed=0-63
rw=randread

[drive_3]
filename=/dev/pmem3
cpus_allowed=0-63
rw=randread

yields about 16M IOPS:
  read : io=9013.8GB, bw=63505MB/s, iops=16257K, runt=145344msec
  cpu          : usr=19.04%, sys=80.86%, ctx=79415, majf=0, minf=4521
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=2362899826/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

with mpstat 1 reporting about 19% usr, 91% sys:
02:17:13 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
02:17:14 PM  all   19.11    0.00   80.89    0.00    0.00    0.00    0.00    0.00    0.00    0.00
02:17:15 PM  all   19.19    0.00   80.81    0.00    0.00    0.00    0.00    0.00    0.00    0.00
02:17:16 PM  all   19.27    0.00   80.73    0.00    0.00    0.00    0.00    0.00    0.00    0.00
02:17:17 PM  all   19.26    0.00   80.74    0.00    0.00    0.00    0.00    0.00    0.00    0.00

With this test, the thread and zero_buffers options don't matter.

The system has 4 NUMA nodes; restricting cpus_allowed to local CPUs
for each pmem device raises that to 20M IOPS.  
  read : io=7998.5GB, bw=78461MB/s, iops=20086K, runt=104388msec
  cpu          : usr=19.55%, sys=56.98%, ctx=43481, majf=0, minf=3956
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=2096751180/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

perf top --dsos fio:
   3.00%  [.] get_io_u
   2.22%  [.] get_next_rand_offset
   2.15%  [.] thread_main
   2.11%  [.] io_u_queued_complete
   1.64%  [.] td_io_queue
   1.44%  [.] __get_io_u
   1.40%  [.] io_completed
   1.17%  [.] fio_libaio_commit
   0.93%  [.] fio_libaio_prep
   0.84%  [.] utime_since_now
   0.74%  [.] wait_for_completions
   0.67%  [.] fio_libaio_queue
   0.60%  [.] fio_libaio_getevents
   0.54%  [.] td_io_getevents

perf top -g:
+   67.45%     0.45%  [kernel]                 [k] entry_SYSCALL_64_fastpath
+   63.61%     0.68%  libaio.so.1.0.1          [.] io_submit
+   61.08%     0.10%  [kernel]                 [k] sys_io_submit
+   59.96%     1.55%  [kernel]                 [k] do_io_submit
+   52.82%     0.68%  [kernel]                 [k] aio_run_iocb
+   42.85%     0.36%  [kernel]                 [k] blkdev_read_iter
+   42.20%     0.88%  [kernel]                 [k] generic_file_read_iter
+   40.96%     0.49%  [kernel]                 [k] blkdev_direct_IO
+   40.20%     2.70%  [kernel]                 [k] dax_do_io
+   35.93%    35.93%  [kernel]                 [k] copy_user_enhanced_fast_string
+    6.09%     2.79%  [kernel]                 [k] aio_complete
+    5.55%     0.43%  [kernel]                 [k] sys_io_getevents
+    5.38%     0.00%  [unknown]                [.] 0x0684000241000684
+    4.09%     0.35%  [kernel]                 [k] read_events
+    3.01%     0.00%  [unknown]                [.] 0000000000000000
+    2.98%     0.62%  [kernel]                 [k] rw_verify_area
+    2.95%     2.93%  fio                      [.] get_io_u
+    2.67%     0.01%  perf                     [.] hist_entry_iter__add
+    2.42%     1.88%  [kernel]                 [k] aio_read_events
+    2.20%     0.36%  [kernel]                 [k] security_file_permission
+    2.13%     2.11%  fio                      [.] thread_main
+    2.09%     2.08%  fio                      [.] get_next_rand_offset
+    2.01%     1.99%  fio                      [.] io_u_queued_complete
+    1.96%     0.00%  libaio.so.1.0.1          [.] 0xffff80df612af644
+    1.66%     1.66%  [kernel]                 [k] lookup_ioctx
+    1.51%     0.23%  [kernel]                 [k] dax_map_atomic
+    1.49%     1.49%  [kernel]                 [k] entry_SYSCALL_64_after_swapgs
+    1.49%     1.48%  fio                      [.] td_io_queue
+    1.46%     1.46%  [kernel]                 [k] __fget
+    1.39%     1.38%  fio                      [.] io_completed
+    1.36%     1.35%  fio                      [.] __get_io_u
+    1.34%     1.34%  [kernel]                 [k] entry_SYSCALL_64
+    1.33%     0.08%  [kernel]                 [k] fget
+    1.14%     1.13%  fio                      [.] fio_libaio_commit
+    1.12%     0.99%  [kernel]                 [k] selinux_file_permission
+    1.03%     1.03%  [kernel]                 [k] kmem_cache_alloc
+    0.94%     0.54%  [kernel]                 [k] bdev_direct_access
+    0.91%     0.14%  [kernel]                 [k] kiocb_free
+    0.90%     0.89%  fio                      [.] fio_libaio_prep
+    0.88%     0.28%  [kernel]                 [k] refill_reqs_available
+    0.86%     0.85%  fio                      [.] utime_since_now
+    0.79%     0.79%  [kernel]                 [k] get_reqs_available
+    0.79%     0.79%  [kernel]                 [k] kmem_cache_free



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fio high IOPS measurement mistake
  2016-03-03 21:03         ` Elliott, Robert (Persistent Memory)
@ 2016-03-04  4:36           ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 22+ messages in thread
From: Vladislav Bolkhovitin @ 2016-03-04  4:36 UTC (permalink / raw)
  To: Elliott, Robert (Persistent Memory), Sitsofe Wheeler,
	fio@vger.kernel.org

Elliott, Robert (Persistent Memory) wrote on 03/03/2016 01:03 PM:
>> -----Original Message-----
>> From: Vladislav Bolkhovitin [mailto:vst@vlnb.net]
>> Sent: Wednesday, March 2, 2016 9:03 PM
>> To: Elliott, Robert (Persistent Memory) <elliott@hpe.com>; Sitsofe Wheeler
>> <sitsofe@gmail.com>; fio@vger.kernel.org
>> Subject: Re: Fio high IOPS measurement mistake
>>
> ...
>>
>> Overall, I appreciate your help, but again, question is not how to improve
>> my results.
>> The question is how to _decrease fio overhead_ with libaio, see subject of
>> this e-mail.
>> It's very different question.
>>
>> Thanks,
>> Vlad
> 
> Here are some example results on one of my test systems with 4.4rc2, 
> showing %usr around 19%.
> 
> This job file:
> [global]
> direct=1
> ioengine=libaio
> norandommap
> randrepeat=0
> bs=4k
> iodepth=1       # irrelevant for pmem
> runtime=600
> time_based=1
> group_reporting
> thread
> gtod_reduce=1   # reduce=1 except for latency test
> zero_buffers
> cpus_allowed_policy=split
> numjobs=16
> 
> [drive_0]
> filename=/dev/pmem0
> cpus_allowed=0-63
> rw=randread
> 
> [drive_1]
> filename=/dev/pmem1
> cpus_allowed=0-63
> rw=randread
> 
> [drive_2]
> filename=/dev/pmem2
> cpus_allowed=0-63
> rw=randread
> 
> [drive_3]
> filename=/dev/pmem3
> cpus_allowed=0-63
> rw=randread
> 
> yields about 16M IOPS:
>   read : io=9013.8GB, bw=63505MB/s, iops=16257K, runt=145344msec
>   cpu          : usr=19.04%, sys=80.86%, ctx=79415, majf=0, minf=4521
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued    : total=r=2362899826/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> with mpstat 1 reporting about 19% usr, 91% sys:
> 02:17:13 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> 02:17:14 PM  all   19.11    0.00   80.89    0.00    0.00    0.00    0.00    0.00    0.00    0.00
> 02:17:15 PM  all   19.19    0.00   80.81    0.00    0.00    0.00    0.00    0.00    0.00    0.00
> 02:17:16 PM  all   19.27    0.00   80.73    0.00    0.00    0.00    0.00    0.00    0.00    0.00
> 02:17:17 PM  all   19.26    0.00   80.74    0.00    0.00    0.00    0.00    0.00    0.00    0.00
> 
> With this test, the thread and zero_buffers options don't matter.
> 
> The system has 4 NUMA nodes; restricting cpus_allowed to local CPUs
> for each pmem device raises that to 20M IOPS.  
>   read : io=7998.5GB, bw=78461MB/s, iops=20086K, runt=104388msec
>   cpu          : usr=19.55%, sys=56.98%, ctx=43481, majf=0, minf=3956
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued    : total=r=2096751180/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> perf top --dsos fio:
>    3.00%  [.] get_io_u
>    2.22%  [.] get_next_rand_offset
>    2.15%  [.] thread_main
>    2.11%  [.] io_u_queued_complete
>    1.64%  [.] td_io_queue
>    1.44%  [.] __get_io_u
>    1.40%  [.] io_completed
>    1.17%  [.] fio_libaio_commit
>    0.93%  [.] fio_libaio_prep
>    0.84%  [.] utime_since_now
>    0.74%  [.] wait_for_completions
>    0.67%  [.] fio_libaio_queue
>    0.60%  [.] fio_libaio_getevents
>    0.54%  [.] td_io_getevents
> 
> perf top -g:
> +   67.45%     0.45%  [kernel]                 [k] entry_SYSCALL_64_fastpath
> +   63.61%     0.68%  libaio.so.1.0.1          [.] io_submit
> +   61.08%     0.10%  [kernel]                 [k] sys_io_submit
> +   59.96%     1.55%  [kernel]                 [k] do_io_submit
> +   52.82%     0.68%  [kernel]                 [k] aio_run_iocb
> +   42.85%     0.36%  [kernel]                 [k] blkdev_read_iter
> +   42.20%     0.88%  [kernel]                 [k] generic_file_read_iter
> +   40.96%     0.49%  [kernel]                 [k] blkdev_direct_IO
> +   40.20%     2.70%  [kernel]                 [k] dax_do_io
> +   35.93%    35.93%  [kernel]                 [k] copy_user_enhanced_fast_string
> +    6.09%     2.79%  [kernel]                 [k] aio_complete
> +    5.55%     0.43%  [kernel]                 [k] sys_io_getevents
> +    5.38%     0.00%  [unknown]                [.] 0x0684000241000684
> +    4.09%     0.35%  [kernel]                 [k] read_events
> +    3.01%     0.00%  [unknown]                [.] 0000000000000000
> +    2.98%     0.62%  [kernel]                 [k] rw_verify_area
> +    2.95%     2.93%  fio                      [.] get_io_u
> +    2.67%     0.01%  perf                     [.] hist_entry_iter__add
> +    2.42%     1.88%  [kernel]                 [k] aio_read_events
> +    2.20%     0.36%  [kernel]                 [k] security_file_permission
> +    2.13%     2.11%  fio                      [.] thread_main
> +    2.09%     2.08%  fio                      [.] get_next_rand_offset
> +    2.01%     1.99%  fio                      [.] io_u_queued_complete
> +    1.96%     0.00%  libaio.so.1.0.1          [.] 0xffff80df612af644
> +    1.66%     1.66%  [kernel]                 [k] lookup_ioctx
> +    1.51%     0.23%  [kernel]                 [k] dax_map_atomic
> +    1.49%     1.49%  [kernel]                 [k] entry_SYSCALL_64_after_swapgs
> +    1.49%     1.48%  fio                      [.] td_io_queue
> +    1.46%     1.46%  [kernel]                 [k] __fget
> +    1.39%     1.38%  fio                      [.] io_completed
> +    1.36%     1.35%  fio                      [.] __get_io_u
> +    1.34%     1.34%  [kernel]                 [k] entry_SYSCALL_64
> +    1.33%     0.08%  [kernel]                 [k] fget
> +    1.14%     1.13%  fio                      [.] fio_libaio_commit
> +    1.12%     0.99%  [kernel]                 [k] selinux_file_permission
> +    1.03%     1.03%  [kernel]                 [k] kmem_cache_alloc
> +    0.94%     0.54%  [kernel]                 [k] bdev_direct_access
> +    0.91%     0.14%  [kernel]                 [k] kiocb_free
> +    0.90%     0.89%  fio                      [.] fio_libaio_prep
> +    0.88%     0.28%  [kernel]                 [k] refill_reqs_available
> +    0.86%     0.85%  fio                      [.] utime_since_now
> +    0.79%     0.79%  [kernel]                 [k] get_reqs_available
> +    0.79%     0.79%  [kernel]                 [k] kmem_cache_free

Thank you, you are proving my point and my concerns. Your per job IOPS (~1M) and user
space consumption (20%) are similar to mine (640K and 25% correspondingly) and far from
max IOPS possible (16M), so fio (or libaio?) overhead is seen in full in your test.
Difference between your and my results might be explained that you are using the latest
development, while I'm using SLES 12SP1, which is, as you can imagine, far behind of
the latest development.

Moreover, what is your PMEM? If it is a regular DDR4, depending from how many DIMMs you
have (I guess, 4+ per NUMA node to populate all memory channels?) it should be capable
of more or much more, than 16M IOPS overall and 1M per-thread (much more for sure), so
it smells to me that the fio instrumental mistake plays significant role in your
measurements, making your results significantly lower, than real HW and IO stack are
really capable, hence you are pushing your fio tool in the range, where its accuracy
drops significantly.

I'd bet, if you take an ideal benchmarking tool with zero overhead, your results would
be significantly higher. Actually, we are seeing this with our SCST (Linux SCSI target)
tests, when with multiple initiators we sometimes have better performance, than with
FIO locally. Until now I have never had time to look at it more closely. Looks like I
have explanation now.

Thanks,
Vlad


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fio high IOPS measurement mistake
  2016-03-02  4:25   ` Vladislav Bolkhovitin
  2016-03-02  7:38     ` Sitsofe Wheeler
  2016-03-02 18:37     ` Elliott, Robert (Persistent Memory)
@ 2016-03-03  3:03     ` Vladislav Bolkhovitin
  2016-03-03  7:10       ` Sitsofe Wheeler
  2 siblings, 1 reply; 22+ messages in thread
From: Vladislav Bolkhovitin @ 2016-03-03  3:03 UTC (permalink / raw)
  To: Sitsofe Wheeler, fio@vger.kernel.org

Vladislav Bolkhovitin wrote on 03/01/2016 08:25 PM:
> Hi,
> 
> Sitsofe Wheeler wrote on 02/29/2016 10:01 PM:
>> Hi,
>>
>> On 1 March 2016 at 05:17, Vladislav Bolkhovitin <vst@vlnb.net> wrote:
>>> Hello,
>>>
>>> I'm currently looking at one NVRAM device, and during fio tests noticed that each fio
>>> thread consumes 30% of user space CPU. I'm using ioengine=libaio, buffered=0, sync=0
>>> and direct=1, so user space CPU consumption should be virtually zero.
>>>
>>> That 30% user CPU consumption makes me suspect that this is overhead for internal fio
>>> housekeeping, i.e., scientifically speaking, fio instrumental measurement mistake (I
>>> hope, I'm using correct English terms).
>>>
>>> Can anybody comment it and suggest how to decrease this user space CPU consumption?
>>>
>>> Here is my full fio job:
>>>
>>> [global]
>>> ioengine=libaio
>>> buffered=0
>>> sync=0
>>> direct=1
>>> randrepeat=1
>>> softrandommap=1
>>> rw=randread
>>> bs=4k
>>> filename=./nvram (it's a link to a block device)
>>> exitall=1
>>> thread=1
>>> disable_lat=1
>>> disable_slat=1
>>> disable_clat=1
>>> loops=10
>>> iodepth=16
>>
>> You appear to be missing gtod_reduce
>> (https://github.com/axboe/fio/blob/fio-2.6/HOWTO#L1668 ) or
>> gettimeofday cpu pinning. You also aren't using batching
>> (https://github.com/axboe/fio/blob/fio-2.6/HOWTO#L815 ).
> 
> Thanks, I tried them, but they did not make any significant difference. The biggest
> difference I had was when I changed CPU governor to "performance". Now I have 20-25%
> user space, measured by fio itself, it's coherent with top. Note, I'm considering
> per-thread CPU consumption, to see it in top you need to press '1' (one line per each CPU).
> 
> I also tried to short circuit the sync engine by calling fio_io_end() directly from top
> of fio_syncio_queue(), so no actual IO is done. The results were interesting enough to
> publish here in details (%% are per job):
> 
> Jobs	IOPS(M)	%user	%sys
> 1	4.3	78	22
> 2	7.6	67	33
> 3	10.5	65	35
> 4	7.7	61	38
> 5	4.8	78	22
> 6	4.7	83	17
> 7	4.8	84	15

Update, since I realized that I forgot to short circuit lseek calls as well. After I
done it in fio_syncio_prep(), the results significantly changed and became much more as
expected:

Jobs	IOPS(M)	%user	%sys
1	6.9	100	0
2	13.6	100	0
3	19.5	100	0
4	25.1	100	0
5	32.5	100	0
6	38.1	100	0
7	45.5	100	0
8	50.8	100	0

So, apparently, the lseek calls were responsible for (1) such significant overhead
(4.3M vs 6.9M single), (2) such poor scalability (10.5M vs 50.8) and (3) the
significant 15-38% sys load in the previous tests.

Such lseeks influence should not be so big, because each job supposed to have
independent, dedicated file handle, correct? It together with 6.9M IOPS per job in the
new result seem to me that very similar problem with libaio might be responsible for
the overhead that made me start this thread (25% user space on 600K IOPS). Don't know
how to check it though, because libaio engine is not so simple to short circuit. Perf
is not helpful, because it produces essentially the same trace as below with no libaio
mentioned anyhow. Oprofile also does not anyhow point out to libaio.

For those who asked about perf profiling, it remained the same as before with the CPU
consumption is all about timekeeping and memset:

-  55.74%  fio  fio                [.] clock_thread_fn
     clock_thread_fn
-   7.11%  fio  [vdso]             [.] 0x00000000000007c2
     0x7ffc411657c2
     __clock_gettime
     0x7ffc411657c6
     __clock_gettime
     0x7ffc4116570d
     __clock_gettime
     0x7ffc411657cb
     __clock_gettime
     0x7ffc41165720
     __clock_gettime
     0x7ffc411657c0
     __clock_gettime
     0x7ffc411657cf
     __clock_gettime
     0x7ffc411657c7
     __clock_gettime
     0x7ffc41165700
-   5.37%  fio  [kernel.kallsyms]  [k] system_call
     system_call
     0x7ffc411657c2
-   3.70%  fio  [kernel.kallsyms]  [k] system_call_after_swapgs
   + system_call_after_swapgs
   + 0x7ffc411657c2
-   3.18%  fio  [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
     copy_user_enhanced_fast_string
     sys_clock_gettime
     system_call_fastpath
     0x7ffc411657c2
     0x7ffc411657c2
-   2.55%  fio  libc-2.19.so       [.] memset
     memset
     0x7320676e69737261
-   2.43%  fio  fio                [.] calibrate_cpu_clock
     calibrate_cpu_clock
+   2.31%  fio  [kernel.kallsyms]  [k] getrawmonotonic
+   1.85%  fio  [kernel.kallsyms]  [k] page_fault
+   1.62%  fio  [kernel.kallsyms]  [k] native_read_tsc
+   1.56%  fio  [kernel.kallsyms]  [k] sys_clock_gettime
+   1.53%  fio  libc-2.19.so       [.] msort_with_tmp.part.0
+   0.96%  fio  [kernel.kallsyms]  [k] clear_page_c_e
+   0.92%  fio  [kernel.kallsyms]  [k] sysret_check
+   0.87%  fio  [kernel.kallsyms]  [k] shmem_getpage_gfp
+   0.58%  fio  [kernel.kallsyms]  [k] read_tsc
+   0.46%  fio  libc-2.19.so       [.] __clock_gettime
+   0.45%  fio  libc-2.19.so       [.] __memcpy_sse2

With sequential reads, i.e. with no random numbers generation and tracking, the results
were:

Jobs	IOPS(M)	%user	%sys
1	7.3	100	0
2	14.7	100	0
3	20.8	100	0
4	27.8	100	0
5	34.7	100	0
6	41.6	100	0
7	48.6	100	0
8	55.4	100	0

So, overhead of the random numbers generation and tracking with used jobs parameters is
about 6-9% or 0.4M IOPS per job, which is quite OK, I'd say.

Vlad

> Results were very consistent between runs. CPU - 8 cores Intel Xeon E5-2667 v3 @
> 3.20GHz with 20M L3 cache and HT. Fio is the latest git.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fio high IOPS measurement mistake
  2016-03-03  3:03     ` Vladislav Bolkhovitin
@ 2016-03-03  7:10       ` Sitsofe Wheeler
  2016-03-03  7:13         ` Sitsofe Wheeler
                           ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Sitsofe Wheeler @ 2016-03-03  7:10 UTC (permalink / raw)
  To: Vladislav Bolkhovitin; +Cc: fio@vger.kernel.org

On 3 March 2016 at 03:03, Vladislav Bolkhovitin <vst@vlnb.net> wrote:
> For those who asked about perf profiling, it remained the same as before with the CPU
> consumption is all about timekeeping and memset:
>
> -  55.74%  fio  fio                [.] clock_thread_fn
>      clock_thread_fn

Perhaps this is what is already included above but could you use the
-g option on perf to collect it into a call-graph and post the top
results?

-- 
Sitsofe | http://sucs.org/~sits/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fio high IOPS measurement mistake
  2016-03-03  7:10       ` Sitsofe Wheeler
@ 2016-03-03  7:13         ` Sitsofe Wheeler
  2016-03-04  4:37           ` Vladislav Bolkhovitin
  2016-03-03 16:20         ` Jens Axboe
  2016-03-04  4:37         ` Vladislav Bolkhovitin
  2 siblings, 1 reply; 22+ messages in thread
From: Sitsofe Wheeler @ 2016-03-03  7:13 UTC (permalink / raw)
  To: Vladislav Bolkhovitin; +Cc: fio@vger.kernel.org

On 3 March 2016 at 07:10, Sitsofe Wheeler <sitsofe@gmail.com> wrote:
> On 3 March 2016 at 03:03, Vladislav Bolkhovitin <vst@vlnb.net> wrote:
>> For those who asked about perf profiling, it remained the same as before with the CPU
>> consumption is all about timekeeping and memset:
>>
>> -  55.74%  fio  fio                [.] clock_thread_fn
>>      clock_thread_fn
>
> Perhaps this is what is already included above but could you use the
> -g option on perf to collect it into a call-graph and post the top
> results?

One extra question: do you see a difference between when you use
threads (thread=1) and when you use processes (thread=0)?

-- 
Sitsofe | http://sucs.org/~sits/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fio high IOPS measurement mistake
  2016-03-03  7:13         ` Sitsofe Wheeler
@ 2016-03-04  4:37           ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 22+ messages in thread
From: Vladislav Bolkhovitin @ 2016-03-04  4:37 UTC (permalink / raw)
  To: Sitsofe Wheeler; +Cc: fio@vger.kernel.org

Sitsofe Wheeler wrote on 03/02/2016 11:13 PM:
> On 3 March 2016 at 07:10, Sitsofe Wheeler <sitsofe@gmail.com> wrote:
>> On 3 March 2016 at 03:03, Vladislav Bolkhovitin <vst@vlnb.net> wrote:
>>> For those who asked about perf profiling, it remained the same as before with the CPU
>>> consumption is all about timekeeping and memset:
>>>
>>> -  55.74%  fio  fio                [.] clock_thread_fn
>>>      clock_thread_fn
>>
>> Perhaps this is what is already included above but could you use the
>> -g option on perf to collect it into a call-graph and post the top
>> results?
> 
> One extra question: do you see a difference between when you use
> threads (thread=1) and when you use processes (thread=0)?

Nothing serious (few %%), which could be easily explained, because I'm running number
of jobs less, than CPU cores count, so each thread or process has a dedicated CPU core.

Thanks,
Vlad




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fio high IOPS measurement mistake
  2016-03-03  7:10       ` Sitsofe Wheeler
  2016-03-03  7:13         ` Sitsofe Wheeler
@ 2016-03-03 16:20         ` Jens Axboe
  2016-03-04  4:37           ` Vladislav Bolkhovitin
  2016-03-04  4:37         ` Vladislav Bolkhovitin
  2 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2016-03-03 16:20 UTC (permalink / raw)
  To: Sitsofe Wheeler; +Cc: Vladislav Bolkhovitin, fio@vger.kernel.org

On Thu, Mar 03 2016, Sitsofe Wheeler wrote:
> On 3 March 2016 at 03:03, Vladislav Bolkhovitin <vst@vlnb.net> wrote:
> > For those who asked about perf profiling, it remained the same as before with the CPU
> > consumption is all about timekeeping and memset:
> >
> > -  55.74%  fio  fio                [.] clock_thread_fn
> >      clock_thread_fn
> 
> Perhaps this is what is already included above but could you use the
> -g option on perf to collect it into a call-graph and post the top
> results?

The above looks like a side effect of using gtod_cpu, it'll burn one
core. For the original poster - did you verify whether using gtod_cpu
was faster than using the CPU clock source in each CPU?

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fio high IOPS measurement mistake
  2016-03-03 16:20         ` Jens Axboe
@ 2016-03-04  4:37           ` Vladislav Bolkhovitin
  2016-03-04 15:33             ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Vladislav Bolkhovitin @ 2016-03-04  4:37 UTC (permalink / raw)
  To: Jens Axboe, Sitsofe Wheeler; +Cc: fio@vger.kernel.org

Jens Axboe wrote on 03/03/2016 08:20 AM:
> On Thu, Mar 03 2016, Sitsofe Wheeler wrote:
>> On 3 March 2016 at 03:03, Vladislav Bolkhovitin <vst@vlnb.net> wrote:
>>> For those who asked about perf profiling, it remained the same as before with the CPU
>>> consumption is all about timekeeping and memset:
>>>
>>> -  55.74%  fio  fio                [.] clock_thread_fn
>>>      clock_thread_fn
>>
>> Perhaps this is what is already included above but could you use the
>> -g option on perf to collect it into a call-graph and post the top
>> results?
> 
> The above looks like a side effect of using gtod_cpu, it'll burn one
> core. For the original poster - did you verify whether using gtod_cpu
> was faster than using the CPU clock source in each CPU?

Yes, I had verified it and mentioned in one of my reports. It slightly decreased the
IOPS. I guess, it's a locking contention somewhere.

Thanks,
Vlad



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fio high IOPS measurement mistake
  2016-03-04  4:37           ` Vladislav Bolkhovitin
@ 2016-03-04 15:33             ` Jens Axboe
  2016-03-05  0:47               ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2016-03-04 15:33 UTC (permalink / raw)
  To: Vladislav Bolkhovitin, Sitsofe Wheeler; +Cc: fio@vger.kernel.org

On 03/03/2016 09:37 PM, Vladislav Bolkhovitin wrote:
> Jens Axboe wrote on 03/03/2016 08:20 AM:
>> On Thu, Mar 03 2016, Sitsofe Wheeler wrote:
>>> On 3 March 2016 at 03:03, Vladislav Bolkhovitin <vst@vlnb.net> wrote:
>>>> For those who asked about perf profiling, it remained the same as before with the CPU
>>>> consumption is all about timekeeping and memset:
>>>>
>>>> -  55.74%  fio  fio                [.] clock_thread_fn
>>>>       clock_thread_fn
>>>
>>> Perhaps this is what is already included above but could you use the
>>> -g option on perf to collect it into a call-graph and post the top
>>> results?
>>
>> The above looks like a side effect of using gtod_cpu, it'll burn one
>> core. For the original poster - did you verify whether using gtod_cpu
>> was faster than using the CPU clock source in each CPU?
>
> Yes, I had verified it and mentioned in one of my reports. It slightly decreased the
> IOPS. I guess, it's a locking contention somewhere.

For clocksource=cpu there is no internal fio contention, nor can there 
be any kernel/OS contention. Getting the clock is serializing, so that 
might slow things down a bit.

I've seen you bring up this contention idea before. Is that pure 
guesswork on your end, or have you profiled any contention? For most fio 
workloads there is no cross-job traffic, that's by design. Fio does add 
some overhead in general (everything does), and some of it is opt-out 
like cutting down on the number of time calls (gtod_cpu). But that's 
different from any locking contention between jobs, since it's constant 
and isn't affected by how you spread the workloads.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fio high IOPS measurement mistake
  2016-03-04 15:33             ` Jens Axboe
@ 2016-03-05  0:47               ` Vladislav Bolkhovitin
  2016-03-05  0:54                 ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Vladislav Bolkhovitin @ 2016-03-05  0:47 UTC (permalink / raw)
  To: Jens Axboe, Sitsofe Wheeler; +Cc: fio@vger.kernel.org

Jens Axboe wrote on 03/04/2016 07:33 AM:
> On 03/03/2016 09:37 PM, Vladislav Bolkhovitin wrote:
>> Jens Axboe wrote on 03/03/2016 08:20 AM:
>>> On Thu, Mar 03 2016, Sitsofe Wheeler wrote:
>>>> On 3 March 2016 at 03:03, Vladislav Bolkhovitin <vst@vlnb.net> wrote:
>>>>> For those who asked about perf profiling, it remained the same as before with the CPU
>>>>> consumption is all about timekeeping and memset:
>>>>>
>>>>> -  55.74%  fio  fio                [.] clock_thread_fn
>>>>>       clock_thread_fn
>>>>
>>>> Perhaps this is what is already included above but could you use the
>>>> -g option on perf to collect it into a call-graph and post the top
>>>> results?
>>>
>>> The above looks like a side effect of using gtod_cpu, it'll burn one
>>> core. For the original poster - did you verify whether using gtod_cpu
>>> was faster than using the CPU clock source in each CPU?
>>
>> Yes, I had verified it and mentioned in one of my reports. It slightly decreased the
>> IOPS. I guess, it's a locking contention somewhere.
> 
> For clocksource=cpu there is no internal fio contention, nor can there be any kernel/OS
> contention. Getting the clock is serializing, so that might slow things down a bit.

Yes. Also, there might be a cache contention here, with one thread writing to a memory
location and multiple threads reading from it. The same type of contention why queue
spinlocks are faster, than ticket spinlocks.

> I've seen you bring up this contention idea before. 

Yes, it was when I forgot to short circuit lseek calls in the sync engine. Usually, if
you see performance dropping from certain number of threads, it is safe to guess there
is a lock contention somewhere.

> Is that pure guesswork on your end, or have you profiled any contention?

Pure guesswork. I'm looking at fio in details only few days, so it is still pretty much
a black box for me. Generally, if you see performance drop with another thread, it must
be either locks contention, or communication overhead. Nowadays the former is more
common, hence the guess.

Thanks,
Vlad

> For most fio workloads there is no cross-job
> traffic, that's by design. Fio does add some overhead in general (everything does), and
> some of it is opt-out like cutting down on the number of time calls (gtod_cpu). But
> that's different from any locking contention between jobs, since it's constant and
> isn't affected by how you spread the workloads.
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fio high IOPS measurement mistake
  2016-03-05  0:47               ` Vladislav Bolkhovitin
@ 2016-03-05  0:54                 ` Jens Axboe
  2016-03-05  1:09                   ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2016-03-05  0:54 UTC (permalink / raw)
  To: Vladislav Bolkhovitin, Sitsofe Wheeler; +Cc: fio@vger.kernel.org

On 03/04/2016 05:47 PM, Vladislav Bolkhovitin wrote:
> Jens Axboe wrote on 03/04/2016 07:33 AM:
>> On 03/03/2016 09:37 PM, Vladislav Bolkhovitin wrote:
>>> Jens Axboe wrote on 03/03/2016 08:20 AM:
>>>> On Thu, Mar 03 2016, Sitsofe Wheeler wrote:
>>>>> On 3 March 2016 at 03:03, Vladislav Bolkhovitin <vst@vlnb.net> wrote:
>>>>>> For those who asked about perf profiling, it remained the same as before with the CPU
>>>>>> consumption is all about timekeeping and memset:
>>>>>>
>>>>>> -  55.74%  fio  fio                [.] clock_thread_fn
>>>>>>        clock_thread_fn
>>>>>
>>>>> Perhaps this is what is already included above but could you use the
>>>>> -g option on perf to collect it into a call-graph and post the top
>>>>> results?
>>>>
>>>> The above looks like a side effect of using gtod_cpu, it'll burn one
>>>> core. For the original poster - did you verify whether using gtod_cpu
>>>> was faster than using the CPU clock source in each CPU?
>>>
>>> Yes, I had verified it and mentioned in one of my reports. It slightly decreased the
>>> IOPS. I guess, it's a locking contention somewhere.
>>
>> For clocksource=cpu there is no internal fio contention, nor can there be any kernel/OS
>> contention. Getting the clock is serializing, so that might slow things down a bit.
>
> Yes. Also, there might be a cache contention here, with one thread writing to a memory
> location and multiple threads reading from it. The same type of contention why queue
> spinlocks are faster, than ticket spinlocks.

But cacheline contention for the clock part would be bigger with 
gtod_cpu, since you have one CPU continually dirtying that cachline and 
the CPUs with jobs running on them reading it. With clocksource=cpu, 
that part would not share data.

>> I've seen you bring up this contention idea before.
>
> Yes, it was when I forgot to short circuit lseek calls in the sync engine. Usually, if
> you see performance dropping from certain number of threads, it is safe to guess there
> is a lock contention somewhere.

That is true, the lseek() part is not in fio though, that's a kernel 
issue. But for this case, avoiding lseek through one of the sync 
variants with offset is the best solution. Might actually makes sense to 
make that the default.

>> Is that pure guesswork on your end, or have you profiled any contention?
>
> Pure guesswork. I'm looking at fio in details only few days, so it is still pretty much
> a black box for me. Generally, if you see performance drop with another thread, it must
> be either locks contention, or communication overhead. Nowadays the former is more
> common, hence the guess.

Right, but that contention can be anywhere from the application to the 
driver. So some more details would be great, when you have them.

One thing I've seen for higher IOPS cases is that the bdev inode inc/dec 
for every IO, that hurts scalability. That was fixed here:

http://git.kernel.dk/cgit/linux-block/commit/?id=fe0f07d08ee35

but you might already be running a kernel with that included (what are 
you running?). Another one is the io stats collected, that tends to 
cause poorer scaling than you would expect, you can turn that off 
through sysfs (/sys/block/<dev>/queue/iostats).

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fio high IOPS measurement mistake
  2016-03-05  0:54                 ` Jens Axboe
@ 2016-03-05  1:09                   ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 22+ messages in thread
From: Vladislav Bolkhovitin @ 2016-03-05  1:09 UTC (permalink / raw)
  To: Jens Axboe, Sitsofe Wheeler; +Cc: fio@vger.kernel.org

Jens Axboe wrote on 03/04/2016 04:54 PM:
>>> I've seen you bring up this contention idea before.
>>
>> Yes, it was when I forgot to short circuit lseek calls in the sync engine. Usually, if
>> you see performance dropping from certain number of threads, it is safe to guess there
>> is a lock contention somewhere.
> 
> That is true, the lseek() part is not in fio though, that's a kernel issue. But for
> this case, avoiding lseek through one of the sync variants with offset is the best
> solution. Might actually makes sense to make that the default.

BTW, as far as I can see, fio is using independent, fully dedicated file handles for
each job. Shouldn't independent file handles, ever for the same block device, behave
fully independently with nothing shared, hence protected by locks?

In my naive understanding, lseek should be just changing internal offset, a single
variable. Perhaps, translation fd (integer) -> corresponding internal structure is what
to blame here (also pure guess).

>>> Is that pure guesswork on your end, or have you profiled any contention?
>>
>> Pure guesswork. I'm looking at fio in details only few days, so it is still pretty much
>> a black box for me. Generally, if you see performance drop with another thread, it must
>> be either locks contention, or communication overhead. Nowadays the former is more
>> common, hence the guess.
> 
> Right, but that contention can be anywhere from the application to the driver. So some
> more details would be great, when you have them.
> 
> One thing I've seen for higher IOPS cases is that the bdev inode inc/dec for every IO,
> that hurts scalability. That was fixed here:
> 
> http://git.kernel.dk/cgit/linux-block/commit/?id=fe0f07d08ee35
> 
> but you might already be running a kernel with that included (what are you running?).
> Another one is the io stats collected, that tends to cause poorer scaling than you
> would expect, you can turn that off through sysfs (/sys/block/<dev>/queue/iostats).

Thanks, I'm planning to update to the latest kernel and try with it.

Vlad



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fio high IOPS measurement mistake
  2016-03-03  7:10       ` Sitsofe Wheeler
  2016-03-03  7:13         ` Sitsofe Wheeler
  2016-03-03 16:20         ` Jens Axboe
@ 2016-03-04  4:37         ` Vladislav Bolkhovitin
  2 siblings, 0 replies; 22+ messages in thread
From: Vladislav Bolkhovitin @ 2016-03-04  4:37 UTC (permalink / raw)
  To: Sitsofe Wheeler; +Cc: fio@vger.kernel.org

Sitsofe Wheeler wrote on 03/02/2016 11:10 PM:
> On 3 March 2016 at 03:03, Vladislav Bolkhovitin <vst@vlnb.net> wrote:
>> For those who asked about perf profiling, it remained the same as before with the CPU
>> consumption is all about timekeeping and memset:
>>
>> -  55.74%  fio  fio                [.] clock_thread_fn
>>      clock_thread_fn
> 
> Perhaps this is what is already included above but could you use the
> -g option on perf to collect it into a call-graph and post the top
> results?

My trace was for -g perf option. All entries marked by '-' were expanded.

Thanks,
Vlad



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fio high IOPS measurement mistake
  2016-03-01  5:17 Fio high IOPS measurement mistake Vladislav Bolkhovitin
  2016-03-01  6:01 ` Sitsofe Wheeler
@ 2016-03-02  8:26 ` Andrey Kuzmin
  2016-03-03  3:02   ` Vladislav Bolkhovitin
  1 sibling, 1 reply; 22+ messages in thread
From: Andrey Kuzmin @ 2016-03-02  8:26 UTC (permalink / raw)
  To: Vladislav Bolkhovitin; +Cc: fio@vger.kernel.org

On Tue, Mar 1, 2016 at 8:17 AM, Vladislav Bolkhovitin <vst@vlnb.net> wrote:
> Hello,
>
> I'm currently looking at one NVRAM device, and during fio tests noticed that each fio
> thread consumes 30% of user space CPU. I'm using ioengine=libaio, buffered=0, sync=0
> and direct=1, so user space CPU consumption should be virtually zero.
>
> That 30% user CPU consumption makes me suspect that this is overhead for internal fio
> housekeeping, i.e., scientifically speaking, fio instrumental measurement mistake (I
> hope, I'm using correct English terms).
>

If you believe fio to be 'mistaken', please profile your runs with
perf and publish the profile,
pointing out what you believe to be a mistake.

> Can anybody comment it and suggest how to decrease this user space CPU consumption?
>
> Here is my full fio job:
>
> [global]
> ioengine=libaio
> buffered=0
> sync=0
> direct=1
> randrepeat=1
> softrandommap=1

I suggest you start with switching the random map off as it's known to
be expensive and,
for some reason, is being maintained even in read-only workloads.

Regards,
Andrey

> rw=randread
> bs=4k
> filename=./nvram (it's a link to a block device)
> exitall=1
> thread=1
> disable_lat=1
> disable_slat=1
> disable_clat=1
> loops=10
> iodepth=16
>
> [file1]
>
> [file2]
>
> I'm working on million+ IOPS range.
>
> Thanks,
> Vlad
> --
> To unsubscribe from this list: send the line "unsubscribe fio" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fio high IOPS measurement mistake
  2016-03-02  8:26 ` Andrey Kuzmin
@ 2016-03-03  3:02   ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 22+ messages in thread
From: Vladislav Bolkhovitin @ 2016-03-03  3:02 UTC (permalink / raw)
  To: Andrey Kuzmin; +Cc: fio@vger.kernel.org

Andrey Kuzmin wrote on 03/02/2016 12:26 AM:
> On Tue, Mar 1, 2016 at 8:17 AM, Vladislav Bolkhovitin <vst@vlnb.net> wrote:
>> Hello,
>>
>> I'm currently looking at one NVRAM device, and during fio tests noticed that each fio
>> thread consumes 30% of user space CPU. I'm using ioengine=libaio, buffered=0, sync=0
>> and direct=1, so user space CPU consumption should be virtually zero.
>>
>> That 30% user CPU consumption makes me suspect that this is overhead for internal fio
>> housekeeping, i.e., scientifically speaking, fio instrumental measurement mistake (I
>> hope, I'm using correct English terms).
> 
> If you believe fio to be 'mistaken', please profile your runs with
> perf and publish the profile,
> pointing out what you believe to be a mistake.

I had done it, see my yesterday's e-mail.

>> Can anybody comment it and suggest how to decrease this user space CPU consumption?
>>
>> Here is my full fio job:
>>
>> [global]
>> ioengine=libaio
>> buffered=0
>> sync=0
>> direct=1
>> randrepeat=1
>> softrandommap=1
> 
> I suggest you start with switching the random map off as it's known to
> be expensive and,
> for some reason, is being maintained even in read-only workloads.

Also done, see the same my e-mail.

Thanks,
Vlad

> Regards,
> Andrey
> 
>> rw=randread
>> bs=4k
>> filename=./nvram (it's a link to a block device)
>> exitall=1
>> thread=1
>> disable_lat=1
>> disable_slat=1
>> disable_clat=1
>> loops=10
>> iodepth=16
>>
>> [file1]
>>
>> [file2]
>>
>> I'm working on million+ IOPS range.
>>
>> Thanks,
>> Vlad


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2016-03-05  1:09 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-01  5:17 Fio high IOPS measurement mistake Vladislav Bolkhovitin
2016-03-01  6:01 ` Sitsofe Wheeler
2016-03-02  4:25   ` Vladislav Bolkhovitin
2016-03-02  7:38     ` Sitsofe Wheeler
2016-03-03  3:02       ` Vladislav Bolkhovitin
2016-03-02 18:37     ` Elliott, Robert (Persistent Memory)
2016-03-03  3:03       ` Vladislav Bolkhovitin
2016-03-03 21:03         ` Elliott, Robert (Persistent Memory)
2016-03-04  4:36           ` Vladislav Bolkhovitin
2016-03-03  3:03     ` Vladislav Bolkhovitin
2016-03-03  7:10       ` Sitsofe Wheeler
2016-03-03  7:13         ` Sitsofe Wheeler
2016-03-04  4:37           ` Vladislav Bolkhovitin
2016-03-03 16:20         ` Jens Axboe
2016-03-04  4:37           ` Vladislav Bolkhovitin
2016-03-04 15:33             ` Jens Axboe
2016-03-05  0:47               ` Vladislav Bolkhovitin
2016-03-05  0:54                 ` Jens Axboe
2016-03-05  1:09                   ` Vladislav Bolkhovitin
2016-03-04  4:37         ` Vladislav Bolkhovitin
2016-03-02  8:26 ` Andrey Kuzmin
2016-03-03  3:02   ` Vladislav Bolkhovitin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox