* Fio high IOPS measurement mistake @ 2016-03-01 5:17 Vladislav Bolkhovitin 2016-03-01 6:01 ` Sitsofe Wheeler 2016-03-02 8:26 ` Andrey Kuzmin 0 siblings, 2 replies; 22+ messages in thread From: Vladislav Bolkhovitin @ 2016-03-01 5:17 UTC (permalink / raw) To: fio Hello, I'm currently looking at one NVRAM device, and during fio tests noticed that each fio thread consumes 30% of user space CPU. I'm using ioengine=libaio, buffered=0, sync=0 and direct=1, so user space CPU consumption should be virtually zero. That 30% user CPU consumption makes me suspect that this is overhead for internal fio housekeeping, i.e., scientifically speaking, fio instrumental measurement mistake (I hope, I'm using correct English terms). Can anybody comment it and suggest how to decrease this user space CPU consumption? Here is my full fio job: [global] ioengine=libaio buffered=0 sync=0 direct=1 randrepeat=1 softrandommap=1 rw=randread bs=4k filename=./nvram (it's a link to a block device) exitall=1 thread=1 disable_lat=1 disable_slat=1 disable_clat=1 loops=10 iodepth=16 [file1] [file2] I'm working on million+ IOPS range. Thanks, Vlad ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fio high IOPS measurement mistake 2016-03-01 5:17 Fio high IOPS measurement mistake Vladislav Bolkhovitin @ 2016-03-01 6:01 ` Sitsofe Wheeler 2016-03-02 4:25 ` Vladislav Bolkhovitin 2016-03-02 8:26 ` Andrey Kuzmin 1 sibling, 1 reply; 22+ messages in thread From: Sitsofe Wheeler @ 2016-03-01 6:01 UTC (permalink / raw) To: Vladislav Bolkhovitin; +Cc: fio@vger.kernel.org Hi, On 1 March 2016 at 05:17, Vladislav Bolkhovitin <vst@vlnb.net> wrote: > Hello, > > I'm currently looking at one NVRAM device, and during fio tests noticed that each fio > thread consumes 30% of user space CPU. I'm using ioengine=libaio, buffered=0, sync=0 > and direct=1, so user space CPU consumption should be virtually zero. > > That 30% user CPU consumption makes me suspect that this is overhead for internal fio > housekeeping, i.e., scientifically speaking, fio instrumental measurement mistake (I > hope, I'm using correct English terms). > > Can anybody comment it and suggest how to decrease this user space CPU consumption? > > Here is my full fio job: > > [global] > ioengine=libaio > buffered=0 > sync=0 > direct=1 > randrepeat=1 > softrandommap=1 > rw=randread > bs=4k > filename=./nvram (it's a link to a block device) > exitall=1 > thread=1 > disable_lat=1 > disable_slat=1 > disable_clat=1 > loops=10 > iodepth=16 You appear to be missing gtod_reduce (https://github.com/axboe/fio/blob/fio-2.6/HOWTO#L1668 ) or gettimeofday cpu pinning. You also aren't using batching (https://github.com/axboe/fio/blob/fio-2.6/HOWTO#L815 ). You may want to look at what fio settings your flash vendor recommends for benchmarking purposes... -- Sitsofe | http://sucs.org/~sits/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fio high IOPS measurement mistake 2016-03-01 6:01 ` Sitsofe Wheeler @ 2016-03-02 4:25 ` Vladislav Bolkhovitin 2016-03-02 7:38 ` Sitsofe Wheeler ` (2 more replies) 0 siblings, 3 replies; 22+ messages in thread From: Vladislav Bolkhovitin @ 2016-03-02 4:25 UTC (permalink / raw) To: Sitsofe Wheeler, fio@vger.kernel.org Hi, Sitsofe Wheeler wrote on 02/29/2016 10:01 PM: > Hi, > > On 1 March 2016 at 05:17, Vladislav Bolkhovitin <vst@vlnb.net> wrote: >> Hello, >> >> I'm currently looking at one NVRAM device, and during fio tests noticed that each fio >> thread consumes 30% of user space CPU. I'm using ioengine=libaio, buffered=0, sync=0 >> and direct=1, so user space CPU consumption should be virtually zero. >> >> That 30% user CPU consumption makes me suspect that this is overhead for internal fio >> housekeeping, i.e., scientifically speaking, fio instrumental measurement mistake (I >> hope, I'm using correct English terms). >> >> Can anybody comment it and suggest how to decrease this user space CPU consumption? >> >> Here is my full fio job: >> >> [global] >> ioengine=libaio >> buffered=0 >> sync=0 >> direct=1 >> randrepeat=1 >> softrandommap=1 >> rw=randread >> bs=4k >> filename=./nvram (it's a link to a block device) >> exitall=1 >> thread=1 >> disable_lat=1 >> disable_slat=1 >> disable_clat=1 >> loops=10 >> iodepth=16 > > You appear to be missing gtod_reduce > (https://github.com/axboe/fio/blob/fio-2.6/HOWTO#L1668 ) or > gettimeofday cpu pinning. You also aren't using batching > (https://github.com/axboe/fio/blob/fio-2.6/HOWTO#L815 ). Thanks, I tried them, but they did not make any significant difference. The biggest difference I had was when I changed CPU governor to "performance". Now I have 20-25% user space, measured by fio itself, it's coherent with top. Note, I'm considering per-thread CPU consumption, to see it in top you need to press '1' (one line per each CPU). I also tried to short circuit the sync engine by calling fio_io_end() directly from top of fio_syncio_queue(), so no actual IO is done. The results were interesting enough to publish here in details (%% are per job): Jobs IOPS(M) %user %sys 1 4.3 78 22 2 7.6 67 33 3 10.5 65 35 4 7.7 61 38 5 4.8 78 22 6 4.7 83 17 7 4.8 84 15 Results were very consistent between runs. CPU - 8 cores Intel Xeon E5-2667 v3 @ 3.20GHz with 20M L3 cache and HT. Fio is the latest git. Obviously, if fio had zero overhead, i.e. instrumental mistake, IOPS level in this test should sky rocket to hundreds of millions IOPS to have few %% overhead on multi-million IOPS measurements. But we only have 4.3M per thread and 10.5M overall, which are, apparently, max fio is capable to measure in the current implementation doesn't matter how fast the storage stack is (it simply doesn't have more CPU cycles to run). Also, apparently, there is some lock contention for something inside fio, which is severely limiting multi-jobs performance. Interesting that gtod_cpu for the single job case decreased IOPS to 3.8M with the same user/sys %%: 22/78. Explicit clocksource=cpu didn't make any difference. Another observation is why the sys CPU consumption is so high, if TSC clock is used? Apparently, it was not used ever despite of explicit clocksource=cpu. I checked perf for the single job case, which is the most interesting case where to start optimizing from, and it reported 69% of time was spent in clock_thread_fn() and 3.2% in memset(). The latter also rises question, why is memset for the READs test? Apparently, this memset is on high IO path. The full job file was: [global] ioengine=sync buffered=0 sync=0 direct=1 randrepeat=0 norandommap softrandommap=1 random_generator=lfsr /* does not really matter */ rw=randread bs=4K /* it does not matter, since it's short circuit */ filename=./nvram /* does not matter */ exitall=1 thread=1 gtod_reduce=1 loops=10 iodepth=8 /* does not matter */ [file1] [file2] ... The consumed user space CPU roughly could be considered the instrumental mistake. Generally speaking, we have 3 components: load generator, measurement infrastructure ("a gauge") and load processor (storage). The storage is the object whose performance we are measuring by applying load from the load generator and using the measurement infrastructure to get the results. Since the storage stack is entirely in the kernel, what we can see as the user space CPU consumption is the aggregated load generator and measurement infrastructure CPU consumption, i.e. fio overhead, i.e. instrumental mistake. (Obviously, this is true only when CPU is the bottleneck as you can see in top with one line per each CPU output, which is pretty much always true for high IOPS tests.) Thus, I'm afraid, looks like currently fio, being a really great tool, has severe limitations for high IOPS measurements, because it has too big internal load generation and measurement overheads. It's like having a thermometer, which has mistake 0-infinity depending from temperature you are measuring. If it's low enough, you will have 100% accuracy, but if it's too high, it might start measuring something internal instead of what it is supposed to measure. To be fair, all thermometers behave like this ;). However, this analyze shows that fio accuracy significantly declining starting from few hundreds K IOPS, where for me it has with libaio and my NVRAM card 22% overhead on 612K IOPS (QD 8, single job). Adding more jobs increases IOPS up to the card's limit, but the per thread overhead remains about the same. Just a friendly analyze in a hope to improve the great tool. Multi-million IOPS storage is coming, so this is important. Or did I miss anything? > You may want to look at what fio settings your flash vendor recommends > for benchmarking purposes... Those where I started from. However, being a person with an experimental physics background, I started from very basics: calibrating my tools to figure out instrumental mistakes I have with them. My checks with fio led to this thread. Thanks, Vlad ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fio high IOPS measurement mistake 2016-03-02 4:25 ` Vladislav Bolkhovitin @ 2016-03-02 7:38 ` Sitsofe Wheeler 2016-03-03 3:02 ` Vladislav Bolkhovitin 2016-03-02 18:37 ` Elliott, Robert (Persistent Memory) 2016-03-03 3:03 ` Vladislav Bolkhovitin 2 siblings, 1 reply; 22+ messages in thread From: Sitsofe Wheeler @ 2016-03-02 7:38 UTC (permalink / raw) To: Vladislav Bolkhovitin; +Cc: fio@vger.kernel.org On 2 March 2016 at 04:25, Vladislav Bolkhovitin <vst@vlnb.net> wrote: > Hi, > > Sitsofe Wheeler wrote on 02/29/2016 10:01 PM: >> Hi, >> >> On 1 March 2016 at 05:17, Vladislav Bolkhovitin <vst@vlnb.net> wrote: >>> Hello, >>> >>> I'm currently looking at one NVRAM device, and during fio tests noticed that each fio >>> thread consumes 30% of user space CPU. I'm using ioengine=libaio, buffered=0, sync=0 >>> and direct=1, so user space CPU consumption should be virtually zero. >>> >>> That 30% user CPU consumption makes me suspect that this is overhead for internal fio >>> housekeeping, i.e., scientifically speaking, fio instrumental measurement mistake (I >>> hope, I'm using correct English terms). >>> >>> Can anybody comment it and suggest how to decrease this user space CPU consumption? >>> >>> Here is my full fio job: >>> >>> [global] >>> ioengine=libaio >>> buffered=0 >>> sync=0 >>> direct=1 >>> randrepeat=1 >>> softrandommap=1 >>> rw=randread >>> bs=4k >>> filename=./nvram (it's a link to a block device) >>> exitall=1 >>> thread=1 >>> disable_lat=1 >>> disable_slat=1 >>> disable_clat=1 >>> loops=10 >>> iodepth=16 >> >> You appear to be missing gtod_reduce >> (https://github.com/axboe/fio/blob/fio-2.6/HOWTO#L1668 ) or >> gettimeofday cpu pinning. You also aren't using batching >> (https://github.com/axboe/fio/blob/fio-2.6/HOWTO#L815 ). > > Thanks, I tried them, but they did not make any significant difference. The biggest There was no cpu reduction from setting both iodepth_batch and iodepth_batch_complete together? > difference I had was when I changed CPU governor to "performance". Now I have 20-25% > user space, measured by fio itself, it's coherent with top. Note, I'm considering > per-thread CPU consumption, to see it in top you need to press '1' (one line per each CPU). Have you applied the points mentioned in https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Performance_Tuning_Guide/chap-Red_Hat_Enterprise_Linux-Performance_Tuning_Guide-Performance_Features_in_RednbspHat_EnterprisenbspLinuxnbsp7.html ? Things like the scheduler granularity (as changed by Red Hat's tuned-adm) can have a large impact... > The full job file was: > > [global] > ioengine=sync > buffered=0 > sync=0 > direct=1 [...] > iodepth=8 /* does not matter */ ^^^ It's worth noting that direct and iodepth don't really have an impact on synchronous ioengines - https://github.com/axboe/fio/blob/master/HOWTO#L804 . -- Sitsofe | http://sucs.org/~sits/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fio high IOPS measurement mistake 2016-03-02 7:38 ` Sitsofe Wheeler @ 2016-03-03 3:02 ` Vladislav Bolkhovitin 0 siblings, 0 replies; 22+ messages in thread From: Vladislav Bolkhovitin @ 2016-03-03 3:02 UTC (permalink / raw) To: Sitsofe Wheeler; +Cc: fio@vger.kernel.org Sitsofe Wheeler wrote on 03/01/2016 11:38 PM: > On 2 March 2016 at 04:25, Vladislav Bolkhovitin <vst@vlnb.net> wrote: >> Hi, >> >> Sitsofe Wheeler wrote on 02/29/2016 10:01 PM: >>> Hi, >>> >>> On 1 March 2016 at 05:17, Vladislav Bolkhovitin <vst@vlnb.net> wrote: >>>> Hello, >>>> >>>> I'm currently looking at one NVRAM device, and during fio tests noticed that each fio >>>> thread consumes 30% of user space CPU. I'm using ioengine=libaio, buffered=0, sync=0 >>>> and direct=1, so user space CPU consumption should be virtually zero. >>>> >>>> That 30% user CPU consumption makes me suspect that this is overhead for internal fio >>>> housekeeping, i.e., scientifically speaking, fio instrumental measurement mistake (I >>>> hope, I'm using correct English terms). >>>> >>>> Can anybody comment it and suggest how to decrease this user space CPU consumption? >>>> >>>> Here is my full fio job: >>>> >>>> [global] >>>> ioengine=libaio >>>> buffered=0 >>>> sync=0 >>>> direct=1 >>>> randrepeat=1 >>>> softrandommap=1 >>>> rw=randread >>>> bs=4k >>>> filename=./nvram (it's a link to a block device) >>>> exitall=1 >>>> thread=1 >>>> disable_lat=1 >>>> disable_slat=1 >>>> disable_clat=1 >>>> loops=10 >>>> iodepth=16 >>> >>> You appear to be missing gtod_reduce >>> (https://github.com/axboe/fio/blob/fio-2.6/HOWTO#L1668 ) or >>> gettimeofday cpu pinning. You also aren't using batching >>> (https://github.com/axboe/fio/blob/fio-2.6/HOWTO#L815 ). >> >> Thanks, I tried them, but they did not make any significant difference. The biggest > > There was no cpu reduction from setting both iodepth_batch and > iodepth_batch_complete together? No, no significant difference. I also forgot to mention that I had tried gtod_cpu=11 as well, and it only made things worse. >> difference I had was when I changed CPU governor to "performance". Now I have 20-25% >> user space, measured by fio itself, it's coherent with top. Note, I'm considering >> per-thread CPU consumption, to see it in top you need to press '1' (one line per each CPU). > > Have you applied the points mentioned in > https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Performance_Tuning_Guide/chap-Red_Hat_Enterprise_Linux-Performance_Tuning_Guide-Performance_Features_in_RednbspHat_EnterprisenbspLinuxnbsp7.html > ? Things like the scheduler granularity (as changed by Red Hat's > tuned-adm) can have a large impact... Thanks for your help and feedback, but you are trying to improve my results, while I'm asking how to _decrease fio overhead_ on high IOPS with libaio. It's very different question. >> The full job file was: >> >> [global] >> ioengine=sync >> buffered=0 >> sync=0 >> direct=1 > [...] >> iodepth=8 /* does not matter */ > ^^^ It's worth noting that direct and iodepth don't really have an > impact on synchronous ioengines - > https://github.com/axboe/fio/blob/master/HOWTO#L804 Yes, sure, this is why I commented it as "doesn't matter". Thanks, Vlad ^ permalink raw reply [flat|nested] 22+ messages in thread
* RE: Fio high IOPS measurement mistake 2016-03-02 4:25 ` Vladislav Bolkhovitin 2016-03-02 7:38 ` Sitsofe Wheeler @ 2016-03-02 18:37 ` Elliott, Robert (Persistent Memory) 2016-03-03 3:03 ` Vladislav Bolkhovitin 2016-03-03 3:03 ` Vladislav Bolkhovitin 2 siblings, 1 reply; 22+ messages in thread From: Elliott, Robert (Persistent Memory) @ 2016-03-02 18:37 UTC (permalink / raw) To: Vladislav Bolkhovitin, Sitsofe Wheeler, fio@vger.kernel.org > -----Original Message----- > From: fio-owner@vger.kernel.org [mailto:fio-owner@vger.kernel.org] On > Behalf Of Vladislav Bolkhovitin > Sent: Tuesday, March 01, 2016 10:25 PM > To: Sitsofe Wheeler <sitsofe@gmail.com>; fio@vger.kernel.org > Subject: Re: Fio high IOPS measurement mistake > > Hi, > ... > >> Here is my full fio job: > >> > >> [global] > >> ioengine=libaio > >> buffered=0 > >> sync=0 > >> direct=1 > >> randrepeat=1 > >> softrandommap=1 > >> rw=randread > >> bs=4k > >> filename=./nvram (it's a link to a block device) > >> exitall=1 > >> thread=1 > >> disable_lat=1 > >> disable_slat=1 > >> disable_clat=1 > >> loops=10 > >> iodepth=16 > > > > You appear to be missing gtod_reduce > > (https://github.com/axboe/fio/blob/fio-2.6/HOWTO#L1668 ) or > > gettimeofday cpu pinning. You also aren't using batching > > (https://github.com/axboe/fio/blob/fio-2.6/HOWTO#L815 ). > ... > Jobs IOPS(M) %user %sys > 1 4.3 78 22 > 2 7.6 67 33 > 3 10.5 65 35 > 4 7.7 61 38 > 5 4.8 78 22 > 6 4.7 83 17 > 7 4.8 84 15 Use cpus_allowed_policy=split to keep threads or processes from wandering across CPUs. If the system is NUMA, split the NVDIMMs based on that and use cpus_allowed to just have threads access local NVDIMMs. As someone else mentioned, the random map overhead is noticeable, especially for large capacity devices. Use norandommap and randrepeat=0. There was a problem in 2014 requiring invalidate=0 for small capacities; I don't recall if that was fixed. If you're using the pmem driver, it is incapable of queuing (submissions are immediately completed), so iodepth is irrelevant. There are other system and OS settings to tune, like: * use the performance governor. The userspace governor spends a lot of time making acpi_os_write calls * turn off auditing (audit=0 on the kernel command line) * turn off C states ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fio high IOPS measurement mistake 2016-03-02 18:37 ` Elliott, Robert (Persistent Memory) @ 2016-03-03 3:03 ` Vladislav Bolkhovitin 2016-03-03 21:03 ` Elliott, Robert (Persistent Memory) 0 siblings, 1 reply; 22+ messages in thread From: Vladislav Bolkhovitin @ 2016-03-03 3:03 UTC (permalink / raw) To: Elliott, Robert (Persistent Memory), Sitsofe Wheeler, fio@vger.kernel.org Elliott, Robert (Persistent Memory) wrote on 03/02/2016 10:37 AM: >>>> Here is my full fio job: >>>> >>>> [global] >>>> ioengine=libaio >>>> buffered=0 >>>> sync=0 >>>> direct=1 >>>> randrepeat=1 >>>> softrandommap=1 >>>> rw=randread >>>> bs=4k >>>> filename=./nvram (it's a link to a block device) >>>> exitall=1 >>>> thread=1 >>>> disable_lat=1 >>>> disable_slat=1 >>>> disable_clat=1 >>>> loops=10 >>>> iodepth=16 >>> >>> You appear to be missing gtod_reduce >>> (https://github.com/axboe/fio/blob/fio-2.6/HOWTO#L1668 ) or >>> gettimeofday cpu pinning. You also aren't using batching >>> (https://github.com/axboe/fio/blob/fio-2.6/HOWTO#L815 ). >> > ... >> Jobs IOPS(M) %user %sys >> 1 4.3 78 22 >> 2 7.6 67 33 >> 3 10.5 65 35 >> 4 7.7 61 38 >> 5 4.8 78 22 >> 6 4.7 83 17 >> 7 4.8 84 15 > > Use cpus_allowed_policy=split to keep threads or processes > from wandering across CPUs. Tried, no help, which is easily explainable, because I'm always running num jobs < num CPUs. > If the system is NUMA, split the NVDIMMs based on that > and use cpus_allowed to just have threads access local > NVDIMMs. No, it's single socket. > As someone else mentioned, the random map overhead is > noticeable, especially for large capacity devices. Use > norandommap and randrepeat=0. Had done, see the job file in my message you didn't quote. > There was a problem in 2014 requiring invalidate=0 for > small capacities; I don't recall if that was fixed. > > If you're using the pmem driver, it is incapable of queuing > (submissions are immediately completed), so iodepth is > irrelevant. > > There are other system and OS settings to tune, like: > * use the performance governor. The userspace governor > spends a lot of time making acpi_os_write calls Had done, see not quoted part of my e-mail. > * turn off auditing (audit=0 on the kernel command line) > * turn off C states Tried too. Overall, I appreciate your help, but again, question is not how to improve my results. The question is how to _decrease fio overhead_ with libaio, see subject of this e-mail. It's very different question. Thanks, Vlad ^ permalink raw reply [flat|nested] 22+ messages in thread
* RE: Fio high IOPS measurement mistake 2016-03-03 3:03 ` Vladislav Bolkhovitin @ 2016-03-03 21:03 ` Elliott, Robert (Persistent Memory) 2016-03-04 4:36 ` Vladislav Bolkhovitin 0 siblings, 1 reply; 22+ messages in thread From: Elliott, Robert (Persistent Memory) @ 2016-03-03 21:03 UTC (permalink / raw) To: Vladislav Bolkhovitin, Sitsofe Wheeler, fio@vger.kernel.org > -----Original Message----- > From: Vladislav Bolkhovitin [mailto:vst@vlnb.net] > Sent: Wednesday, March 2, 2016 9:03 PM > To: Elliott, Robert (Persistent Memory) <elliott@hpe.com>; Sitsofe Wheeler > <sitsofe@gmail.com>; fio@vger.kernel.org > Subject: Re: Fio high IOPS measurement mistake > ... > > Overall, I appreciate your help, but again, question is not how to improve > my results. > The question is how to _decrease fio overhead_ with libaio, see subject of > this e-mail. > It's very different question. > > Thanks, > Vlad Here are some example results on one of my test systems with 4.4rc2, showing %usr around 19%. This job file: [global] direct=1 ioengine=libaio norandommap randrepeat=0 bs=4k iodepth=1 # irrelevant for pmem runtime=600 time_based=1 group_reporting thread gtod_reduce=1 # reduce=1 except for latency test zero_buffers cpus_allowed_policy=split numjobs=16 [drive_0] filename=/dev/pmem0 cpus_allowed=0-63 rw=randread [drive_1] filename=/dev/pmem1 cpus_allowed=0-63 rw=randread [drive_2] filename=/dev/pmem2 cpus_allowed=0-63 rw=randread [drive_3] filename=/dev/pmem3 cpus_allowed=0-63 rw=randread yields about 16M IOPS: read : io=9013.8GB, bw=63505MB/s, iops=16257K, runt=145344msec cpu : usr=19.04%, sys=80.86%, ctx=79415, majf=0, minf=4521 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=2362899826/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 with mpstat 1 reporting about 19% usr, 91% sys: 02:17:13 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 02:17:14 PM all 19.11 0.00 80.89 0.00 0.00 0.00 0.00 0.00 0.00 0.00 02:17:15 PM all 19.19 0.00 80.81 0.00 0.00 0.00 0.00 0.00 0.00 0.00 02:17:16 PM all 19.27 0.00 80.73 0.00 0.00 0.00 0.00 0.00 0.00 0.00 02:17:17 PM all 19.26 0.00 80.74 0.00 0.00 0.00 0.00 0.00 0.00 0.00 With this test, the thread and zero_buffers options don't matter. The system has 4 NUMA nodes; restricting cpus_allowed to local CPUs for each pmem device raises that to 20M IOPS. read : io=7998.5GB, bw=78461MB/s, iops=20086K, runt=104388msec cpu : usr=19.55%, sys=56.98%, ctx=43481, majf=0, minf=3956 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=2096751180/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 perf top --dsos fio: 3.00% [.] get_io_u 2.22% [.] get_next_rand_offset 2.15% [.] thread_main 2.11% [.] io_u_queued_complete 1.64% [.] td_io_queue 1.44% [.] __get_io_u 1.40% [.] io_completed 1.17% [.] fio_libaio_commit 0.93% [.] fio_libaio_prep 0.84% [.] utime_since_now 0.74% [.] wait_for_completions 0.67% [.] fio_libaio_queue 0.60% [.] fio_libaio_getevents 0.54% [.] td_io_getevents perf top -g: + 67.45% 0.45% [kernel] [k] entry_SYSCALL_64_fastpath + 63.61% 0.68% libaio.so.1.0.1 [.] io_submit + 61.08% 0.10% [kernel] [k] sys_io_submit + 59.96% 1.55% [kernel] [k] do_io_submit + 52.82% 0.68% [kernel] [k] aio_run_iocb + 42.85% 0.36% [kernel] [k] blkdev_read_iter + 42.20% 0.88% [kernel] [k] generic_file_read_iter + 40.96% 0.49% [kernel] [k] blkdev_direct_IO + 40.20% 2.70% [kernel] [k] dax_do_io + 35.93% 35.93% [kernel] [k] copy_user_enhanced_fast_string + 6.09% 2.79% [kernel] [k] aio_complete + 5.55% 0.43% [kernel] [k] sys_io_getevents + 5.38% 0.00% [unknown] [.] 0x0684000241000684 + 4.09% 0.35% [kernel] [k] read_events + 3.01% 0.00% [unknown] [.] 0000000000000000 + 2.98% 0.62% [kernel] [k] rw_verify_area + 2.95% 2.93% fio [.] get_io_u + 2.67% 0.01% perf [.] hist_entry_iter__add + 2.42% 1.88% [kernel] [k] aio_read_events + 2.20% 0.36% [kernel] [k] security_file_permission + 2.13% 2.11% fio [.] thread_main + 2.09% 2.08% fio [.] get_next_rand_offset + 2.01% 1.99% fio [.] io_u_queued_complete + 1.96% 0.00% libaio.so.1.0.1 [.] 0xffff80df612af644 + 1.66% 1.66% [kernel] [k] lookup_ioctx + 1.51% 0.23% [kernel] [k] dax_map_atomic + 1.49% 1.49% [kernel] [k] entry_SYSCALL_64_after_swapgs + 1.49% 1.48% fio [.] td_io_queue + 1.46% 1.46% [kernel] [k] __fget + 1.39% 1.38% fio [.] io_completed + 1.36% 1.35% fio [.] __get_io_u + 1.34% 1.34% [kernel] [k] entry_SYSCALL_64 + 1.33% 0.08% [kernel] [k] fget + 1.14% 1.13% fio [.] fio_libaio_commit + 1.12% 0.99% [kernel] [k] selinux_file_permission + 1.03% 1.03% [kernel] [k] kmem_cache_alloc + 0.94% 0.54% [kernel] [k] bdev_direct_access + 0.91% 0.14% [kernel] [k] kiocb_free + 0.90% 0.89% fio [.] fio_libaio_prep + 0.88% 0.28% [kernel] [k] refill_reqs_available + 0.86% 0.85% fio [.] utime_since_now + 0.79% 0.79% [kernel] [k] get_reqs_available + 0.79% 0.79% [kernel] [k] kmem_cache_free ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fio high IOPS measurement mistake 2016-03-03 21:03 ` Elliott, Robert (Persistent Memory) @ 2016-03-04 4:36 ` Vladislav Bolkhovitin 0 siblings, 0 replies; 22+ messages in thread From: Vladislav Bolkhovitin @ 2016-03-04 4:36 UTC (permalink / raw) To: Elliott, Robert (Persistent Memory), Sitsofe Wheeler, fio@vger.kernel.org Elliott, Robert (Persistent Memory) wrote on 03/03/2016 01:03 PM: >> -----Original Message----- >> From: Vladislav Bolkhovitin [mailto:vst@vlnb.net] >> Sent: Wednesday, March 2, 2016 9:03 PM >> To: Elliott, Robert (Persistent Memory) <elliott@hpe.com>; Sitsofe Wheeler >> <sitsofe@gmail.com>; fio@vger.kernel.org >> Subject: Re: Fio high IOPS measurement mistake >> > ... >> >> Overall, I appreciate your help, but again, question is not how to improve >> my results. >> The question is how to _decrease fio overhead_ with libaio, see subject of >> this e-mail. >> It's very different question. >> >> Thanks, >> Vlad > > Here are some example results on one of my test systems with 4.4rc2, > showing %usr around 19%. > > This job file: > [global] > direct=1 > ioengine=libaio > norandommap > randrepeat=0 > bs=4k > iodepth=1 # irrelevant for pmem > runtime=600 > time_based=1 > group_reporting > thread > gtod_reduce=1 # reduce=1 except for latency test > zero_buffers > cpus_allowed_policy=split > numjobs=16 > > [drive_0] > filename=/dev/pmem0 > cpus_allowed=0-63 > rw=randread > > [drive_1] > filename=/dev/pmem1 > cpus_allowed=0-63 > rw=randread > > [drive_2] > filename=/dev/pmem2 > cpus_allowed=0-63 > rw=randread > > [drive_3] > filename=/dev/pmem3 > cpus_allowed=0-63 > rw=randread > > yields about 16M IOPS: > read : io=9013.8GB, bw=63505MB/s, iops=16257K, runt=145344msec > cpu : usr=19.04%, sys=80.86%, ctx=79415, majf=0, minf=4521 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued : total=r=2362899826/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 > latency : target=0, window=0, percentile=100.00%, depth=1 > > with mpstat 1 reporting about 19% usr, 91% sys: > 02:17:13 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle > 02:17:14 PM all 19.11 0.00 80.89 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 02:17:15 PM all 19.19 0.00 80.81 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 02:17:16 PM all 19.27 0.00 80.73 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 02:17:17 PM all 19.26 0.00 80.74 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > > With this test, the thread and zero_buffers options don't matter. > > The system has 4 NUMA nodes; restricting cpus_allowed to local CPUs > for each pmem device raises that to 20M IOPS. > read : io=7998.5GB, bw=78461MB/s, iops=20086K, runt=104388msec > cpu : usr=19.55%, sys=56.98%, ctx=43481, majf=0, minf=3956 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued : total=r=2096751180/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 > latency : target=0, window=0, percentile=100.00%, depth=1 > > perf top --dsos fio: > 3.00% [.] get_io_u > 2.22% [.] get_next_rand_offset > 2.15% [.] thread_main > 2.11% [.] io_u_queued_complete > 1.64% [.] td_io_queue > 1.44% [.] __get_io_u > 1.40% [.] io_completed > 1.17% [.] fio_libaio_commit > 0.93% [.] fio_libaio_prep > 0.84% [.] utime_since_now > 0.74% [.] wait_for_completions > 0.67% [.] fio_libaio_queue > 0.60% [.] fio_libaio_getevents > 0.54% [.] td_io_getevents > > perf top -g: > + 67.45% 0.45% [kernel] [k] entry_SYSCALL_64_fastpath > + 63.61% 0.68% libaio.so.1.0.1 [.] io_submit > + 61.08% 0.10% [kernel] [k] sys_io_submit > + 59.96% 1.55% [kernel] [k] do_io_submit > + 52.82% 0.68% [kernel] [k] aio_run_iocb > + 42.85% 0.36% [kernel] [k] blkdev_read_iter > + 42.20% 0.88% [kernel] [k] generic_file_read_iter > + 40.96% 0.49% [kernel] [k] blkdev_direct_IO > + 40.20% 2.70% [kernel] [k] dax_do_io > + 35.93% 35.93% [kernel] [k] copy_user_enhanced_fast_string > + 6.09% 2.79% [kernel] [k] aio_complete > + 5.55% 0.43% [kernel] [k] sys_io_getevents > + 5.38% 0.00% [unknown] [.] 0x0684000241000684 > + 4.09% 0.35% [kernel] [k] read_events > + 3.01% 0.00% [unknown] [.] 0000000000000000 > + 2.98% 0.62% [kernel] [k] rw_verify_area > + 2.95% 2.93% fio [.] get_io_u > + 2.67% 0.01% perf [.] hist_entry_iter__add > + 2.42% 1.88% [kernel] [k] aio_read_events > + 2.20% 0.36% [kernel] [k] security_file_permission > + 2.13% 2.11% fio [.] thread_main > + 2.09% 2.08% fio [.] get_next_rand_offset > + 2.01% 1.99% fio [.] io_u_queued_complete > + 1.96% 0.00% libaio.so.1.0.1 [.] 0xffff80df612af644 > + 1.66% 1.66% [kernel] [k] lookup_ioctx > + 1.51% 0.23% [kernel] [k] dax_map_atomic > + 1.49% 1.49% [kernel] [k] entry_SYSCALL_64_after_swapgs > + 1.49% 1.48% fio [.] td_io_queue > + 1.46% 1.46% [kernel] [k] __fget > + 1.39% 1.38% fio [.] io_completed > + 1.36% 1.35% fio [.] __get_io_u > + 1.34% 1.34% [kernel] [k] entry_SYSCALL_64 > + 1.33% 0.08% [kernel] [k] fget > + 1.14% 1.13% fio [.] fio_libaio_commit > + 1.12% 0.99% [kernel] [k] selinux_file_permission > + 1.03% 1.03% [kernel] [k] kmem_cache_alloc > + 0.94% 0.54% [kernel] [k] bdev_direct_access > + 0.91% 0.14% [kernel] [k] kiocb_free > + 0.90% 0.89% fio [.] fio_libaio_prep > + 0.88% 0.28% [kernel] [k] refill_reqs_available > + 0.86% 0.85% fio [.] utime_since_now > + 0.79% 0.79% [kernel] [k] get_reqs_available > + 0.79% 0.79% [kernel] [k] kmem_cache_free Thank you, you are proving my point and my concerns. Your per job IOPS (~1M) and user space consumption (20%) are similar to mine (640K and 25% correspondingly) and far from max IOPS possible (16M), so fio (or libaio?) overhead is seen in full in your test. Difference between your and my results might be explained that you are using the latest development, while I'm using SLES 12SP1, which is, as you can imagine, far behind of the latest development. Moreover, what is your PMEM? If it is a regular DDR4, depending from how many DIMMs you have (I guess, 4+ per NUMA node to populate all memory channels?) it should be capable of more or much more, than 16M IOPS overall and 1M per-thread (much more for sure), so it smells to me that the fio instrumental mistake plays significant role in your measurements, making your results significantly lower, than real HW and IO stack are really capable, hence you are pushing your fio tool in the range, where its accuracy drops significantly. I'd bet, if you take an ideal benchmarking tool with zero overhead, your results would be significantly higher. Actually, we are seeing this with our SCST (Linux SCSI target) tests, when with multiple initiators we sometimes have better performance, than with FIO locally. Until now I have never had time to look at it more closely. Looks like I have explanation now. Thanks, Vlad ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fio high IOPS measurement mistake 2016-03-02 4:25 ` Vladislav Bolkhovitin 2016-03-02 7:38 ` Sitsofe Wheeler 2016-03-02 18:37 ` Elliott, Robert (Persistent Memory) @ 2016-03-03 3:03 ` Vladislav Bolkhovitin 2016-03-03 7:10 ` Sitsofe Wheeler 2 siblings, 1 reply; 22+ messages in thread From: Vladislav Bolkhovitin @ 2016-03-03 3:03 UTC (permalink / raw) To: Sitsofe Wheeler, fio@vger.kernel.org Vladislav Bolkhovitin wrote on 03/01/2016 08:25 PM: > Hi, > > Sitsofe Wheeler wrote on 02/29/2016 10:01 PM: >> Hi, >> >> On 1 March 2016 at 05:17, Vladislav Bolkhovitin <vst@vlnb.net> wrote: >>> Hello, >>> >>> I'm currently looking at one NVRAM device, and during fio tests noticed that each fio >>> thread consumes 30% of user space CPU. I'm using ioengine=libaio, buffered=0, sync=0 >>> and direct=1, so user space CPU consumption should be virtually zero. >>> >>> That 30% user CPU consumption makes me suspect that this is overhead for internal fio >>> housekeeping, i.e., scientifically speaking, fio instrumental measurement mistake (I >>> hope, I'm using correct English terms). >>> >>> Can anybody comment it and suggest how to decrease this user space CPU consumption? >>> >>> Here is my full fio job: >>> >>> [global] >>> ioengine=libaio >>> buffered=0 >>> sync=0 >>> direct=1 >>> randrepeat=1 >>> softrandommap=1 >>> rw=randread >>> bs=4k >>> filename=./nvram (it's a link to a block device) >>> exitall=1 >>> thread=1 >>> disable_lat=1 >>> disable_slat=1 >>> disable_clat=1 >>> loops=10 >>> iodepth=16 >> >> You appear to be missing gtod_reduce >> (https://github.com/axboe/fio/blob/fio-2.6/HOWTO#L1668 ) or >> gettimeofday cpu pinning. You also aren't using batching >> (https://github.com/axboe/fio/blob/fio-2.6/HOWTO#L815 ). > > Thanks, I tried them, but they did not make any significant difference. The biggest > difference I had was when I changed CPU governor to "performance". Now I have 20-25% > user space, measured by fio itself, it's coherent with top. Note, I'm considering > per-thread CPU consumption, to see it in top you need to press '1' (one line per each CPU). > > I also tried to short circuit the sync engine by calling fio_io_end() directly from top > of fio_syncio_queue(), so no actual IO is done. The results were interesting enough to > publish here in details (%% are per job): > > Jobs IOPS(M) %user %sys > 1 4.3 78 22 > 2 7.6 67 33 > 3 10.5 65 35 > 4 7.7 61 38 > 5 4.8 78 22 > 6 4.7 83 17 > 7 4.8 84 15 Update, since I realized that I forgot to short circuit lseek calls as well. After I done it in fio_syncio_prep(), the results significantly changed and became much more as expected: Jobs IOPS(M) %user %sys 1 6.9 100 0 2 13.6 100 0 3 19.5 100 0 4 25.1 100 0 5 32.5 100 0 6 38.1 100 0 7 45.5 100 0 8 50.8 100 0 So, apparently, the lseek calls were responsible for (1) such significant overhead (4.3M vs 6.9M single), (2) such poor scalability (10.5M vs 50.8) and (3) the significant 15-38% sys load in the previous tests. Such lseeks influence should not be so big, because each job supposed to have independent, dedicated file handle, correct? It together with 6.9M IOPS per job in the new result seem to me that very similar problem with libaio might be responsible for the overhead that made me start this thread (25% user space on 600K IOPS). Don't know how to check it though, because libaio engine is not so simple to short circuit. Perf is not helpful, because it produces essentially the same trace as below with no libaio mentioned anyhow. Oprofile also does not anyhow point out to libaio. For those who asked about perf profiling, it remained the same as before with the CPU consumption is all about timekeeping and memset: - 55.74% fio fio [.] clock_thread_fn clock_thread_fn - 7.11% fio [vdso] [.] 0x00000000000007c2 0x7ffc411657c2 __clock_gettime 0x7ffc411657c6 __clock_gettime 0x7ffc4116570d __clock_gettime 0x7ffc411657cb __clock_gettime 0x7ffc41165720 __clock_gettime 0x7ffc411657c0 __clock_gettime 0x7ffc411657cf __clock_gettime 0x7ffc411657c7 __clock_gettime 0x7ffc41165700 - 5.37% fio [kernel.kallsyms] [k] system_call system_call 0x7ffc411657c2 - 3.70% fio [kernel.kallsyms] [k] system_call_after_swapgs + system_call_after_swapgs + 0x7ffc411657c2 - 3.18% fio [kernel.kallsyms] [k] copy_user_enhanced_fast_string copy_user_enhanced_fast_string sys_clock_gettime system_call_fastpath 0x7ffc411657c2 0x7ffc411657c2 - 2.55% fio libc-2.19.so [.] memset memset 0x7320676e69737261 - 2.43% fio fio [.] calibrate_cpu_clock calibrate_cpu_clock + 2.31% fio [kernel.kallsyms] [k] getrawmonotonic + 1.85% fio [kernel.kallsyms] [k] page_fault + 1.62% fio [kernel.kallsyms] [k] native_read_tsc + 1.56% fio [kernel.kallsyms] [k] sys_clock_gettime + 1.53% fio libc-2.19.so [.] msort_with_tmp.part.0 + 0.96% fio [kernel.kallsyms] [k] clear_page_c_e + 0.92% fio [kernel.kallsyms] [k] sysret_check + 0.87% fio [kernel.kallsyms] [k] shmem_getpage_gfp + 0.58% fio [kernel.kallsyms] [k] read_tsc + 0.46% fio libc-2.19.so [.] __clock_gettime + 0.45% fio libc-2.19.so [.] __memcpy_sse2 With sequential reads, i.e. with no random numbers generation and tracking, the results were: Jobs IOPS(M) %user %sys 1 7.3 100 0 2 14.7 100 0 3 20.8 100 0 4 27.8 100 0 5 34.7 100 0 6 41.6 100 0 7 48.6 100 0 8 55.4 100 0 So, overhead of the random numbers generation and tracking with used jobs parameters is about 6-9% or 0.4M IOPS per job, which is quite OK, I'd say. Vlad > Results were very consistent between runs. CPU - 8 cores Intel Xeon E5-2667 v3 @ > 3.20GHz with 20M L3 cache and HT. Fio is the latest git. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fio high IOPS measurement mistake 2016-03-03 3:03 ` Vladislav Bolkhovitin @ 2016-03-03 7:10 ` Sitsofe Wheeler 2016-03-03 7:13 ` Sitsofe Wheeler ` (2 more replies) 0 siblings, 3 replies; 22+ messages in thread From: Sitsofe Wheeler @ 2016-03-03 7:10 UTC (permalink / raw) To: Vladislav Bolkhovitin; +Cc: fio@vger.kernel.org On 3 March 2016 at 03:03, Vladislav Bolkhovitin <vst@vlnb.net> wrote: > For those who asked about perf profiling, it remained the same as before with the CPU > consumption is all about timekeeping and memset: > > - 55.74% fio fio [.] clock_thread_fn > clock_thread_fn Perhaps this is what is already included above but could you use the -g option on perf to collect it into a call-graph and post the top results? -- Sitsofe | http://sucs.org/~sits/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fio high IOPS measurement mistake 2016-03-03 7:10 ` Sitsofe Wheeler @ 2016-03-03 7:13 ` Sitsofe Wheeler 2016-03-04 4:37 ` Vladislav Bolkhovitin 2016-03-03 16:20 ` Jens Axboe 2016-03-04 4:37 ` Vladislav Bolkhovitin 2 siblings, 1 reply; 22+ messages in thread From: Sitsofe Wheeler @ 2016-03-03 7:13 UTC (permalink / raw) To: Vladislav Bolkhovitin; +Cc: fio@vger.kernel.org On 3 March 2016 at 07:10, Sitsofe Wheeler <sitsofe@gmail.com> wrote: > On 3 March 2016 at 03:03, Vladislav Bolkhovitin <vst@vlnb.net> wrote: >> For those who asked about perf profiling, it remained the same as before with the CPU >> consumption is all about timekeeping and memset: >> >> - 55.74% fio fio [.] clock_thread_fn >> clock_thread_fn > > Perhaps this is what is already included above but could you use the > -g option on perf to collect it into a call-graph and post the top > results? One extra question: do you see a difference between when you use threads (thread=1) and when you use processes (thread=0)? -- Sitsofe | http://sucs.org/~sits/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fio high IOPS measurement mistake 2016-03-03 7:13 ` Sitsofe Wheeler @ 2016-03-04 4:37 ` Vladislav Bolkhovitin 0 siblings, 0 replies; 22+ messages in thread From: Vladislav Bolkhovitin @ 2016-03-04 4:37 UTC (permalink / raw) To: Sitsofe Wheeler; +Cc: fio@vger.kernel.org Sitsofe Wheeler wrote on 03/02/2016 11:13 PM: > On 3 March 2016 at 07:10, Sitsofe Wheeler <sitsofe@gmail.com> wrote: >> On 3 March 2016 at 03:03, Vladislav Bolkhovitin <vst@vlnb.net> wrote: >>> For those who asked about perf profiling, it remained the same as before with the CPU >>> consumption is all about timekeeping and memset: >>> >>> - 55.74% fio fio [.] clock_thread_fn >>> clock_thread_fn >> >> Perhaps this is what is already included above but could you use the >> -g option on perf to collect it into a call-graph and post the top >> results? > > One extra question: do you see a difference between when you use > threads (thread=1) and when you use processes (thread=0)? Nothing serious (few %%), which could be easily explained, because I'm running number of jobs less, than CPU cores count, so each thread or process has a dedicated CPU core. Thanks, Vlad ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fio high IOPS measurement mistake 2016-03-03 7:10 ` Sitsofe Wheeler 2016-03-03 7:13 ` Sitsofe Wheeler @ 2016-03-03 16:20 ` Jens Axboe 2016-03-04 4:37 ` Vladislav Bolkhovitin 2016-03-04 4:37 ` Vladislav Bolkhovitin 2 siblings, 1 reply; 22+ messages in thread From: Jens Axboe @ 2016-03-03 16:20 UTC (permalink / raw) To: Sitsofe Wheeler; +Cc: Vladislav Bolkhovitin, fio@vger.kernel.org On Thu, Mar 03 2016, Sitsofe Wheeler wrote: > On 3 March 2016 at 03:03, Vladislav Bolkhovitin <vst@vlnb.net> wrote: > > For those who asked about perf profiling, it remained the same as before with the CPU > > consumption is all about timekeeping and memset: > > > > - 55.74% fio fio [.] clock_thread_fn > > clock_thread_fn > > Perhaps this is what is already included above but could you use the > -g option on perf to collect it into a call-graph and post the top > results? The above looks like a side effect of using gtod_cpu, it'll burn one core. For the original poster - did you verify whether using gtod_cpu was faster than using the CPU clock source in each CPU? -- Jens Axboe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fio high IOPS measurement mistake 2016-03-03 16:20 ` Jens Axboe @ 2016-03-04 4:37 ` Vladislav Bolkhovitin 2016-03-04 15:33 ` Jens Axboe 0 siblings, 1 reply; 22+ messages in thread From: Vladislav Bolkhovitin @ 2016-03-04 4:37 UTC (permalink / raw) To: Jens Axboe, Sitsofe Wheeler; +Cc: fio@vger.kernel.org Jens Axboe wrote on 03/03/2016 08:20 AM: > On Thu, Mar 03 2016, Sitsofe Wheeler wrote: >> On 3 March 2016 at 03:03, Vladislav Bolkhovitin <vst@vlnb.net> wrote: >>> For those who asked about perf profiling, it remained the same as before with the CPU >>> consumption is all about timekeeping and memset: >>> >>> - 55.74% fio fio [.] clock_thread_fn >>> clock_thread_fn >> >> Perhaps this is what is already included above but could you use the >> -g option on perf to collect it into a call-graph and post the top >> results? > > The above looks like a side effect of using gtod_cpu, it'll burn one > core. For the original poster - did you verify whether using gtod_cpu > was faster than using the CPU clock source in each CPU? Yes, I had verified it and mentioned in one of my reports. It slightly decreased the IOPS. I guess, it's a locking contention somewhere. Thanks, Vlad ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fio high IOPS measurement mistake 2016-03-04 4:37 ` Vladislav Bolkhovitin @ 2016-03-04 15:33 ` Jens Axboe 2016-03-05 0:47 ` Vladislav Bolkhovitin 0 siblings, 1 reply; 22+ messages in thread From: Jens Axboe @ 2016-03-04 15:33 UTC (permalink / raw) To: Vladislav Bolkhovitin, Sitsofe Wheeler; +Cc: fio@vger.kernel.org On 03/03/2016 09:37 PM, Vladislav Bolkhovitin wrote: > Jens Axboe wrote on 03/03/2016 08:20 AM: >> On Thu, Mar 03 2016, Sitsofe Wheeler wrote: >>> On 3 March 2016 at 03:03, Vladislav Bolkhovitin <vst@vlnb.net> wrote: >>>> For those who asked about perf profiling, it remained the same as before with the CPU >>>> consumption is all about timekeeping and memset: >>>> >>>> - 55.74% fio fio [.] clock_thread_fn >>>> clock_thread_fn >>> >>> Perhaps this is what is already included above but could you use the >>> -g option on perf to collect it into a call-graph and post the top >>> results? >> >> The above looks like a side effect of using gtod_cpu, it'll burn one >> core. For the original poster - did you verify whether using gtod_cpu >> was faster than using the CPU clock source in each CPU? > > Yes, I had verified it and mentioned in one of my reports. It slightly decreased the > IOPS. I guess, it's a locking contention somewhere. For clocksource=cpu there is no internal fio contention, nor can there be any kernel/OS contention. Getting the clock is serializing, so that might slow things down a bit. I've seen you bring up this contention idea before. Is that pure guesswork on your end, or have you profiled any contention? For most fio workloads there is no cross-job traffic, that's by design. Fio does add some overhead in general (everything does), and some of it is opt-out like cutting down on the number of time calls (gtod_cpu). But that's different from any locking contention between jobs, since it's constant and isn't affected by how you spread the workloads. -- Jens Axboe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fio high IOPS measurement mistake 2016-03-04 15:33 ` Jens Axboe @ 2016-03-05 0:47 ` Vladislav Bolkhovitin 2016-03-05 0:54 ` Jens Axboe 0 siblings, 1 reply; 22+ messages in thread From: Vladislav Bolkhovitin @ 2016-03-05 0:47 UTC (permalink / raw) To: Jens Axboe, Sitsofe Wheeler; +Cc: fio@vger.kernel.org Jens Axboe wrote on 03/04/2016 07:33 AM: > On 03/03/2016 09:37 PM, Vladislav Bolkhovitin wrote: >> Jens Axboe wrote on 03/03/2016 08:20 AM: >>> On Thu, Mar 03 2016, Sitsofe Wheeler wrote: >>>> On 3 March 2016 at 03:03, Vladislav Bolkhovitin <vst@vlnb.net> wrote: >>>>> For those who asked about perf profiling, it remained the same as before with the CPU >>>>> consumption is all about timekeeping and memset: >>>>> >>>>> - 55.74% fio fio [.] clock_thread_fn >>>>> clock_thread_fn >>>> >>>> Perhaps this is what is already included above but could you use the >>>> -g option on perf to collect it into a call-graph and post the top >>>> results? >>> >>> The above looks like a side effect of using gtod_cpu, it'll burn one >>> core. For the original poster - did you verify whether using gtod_cpu >>> was faster than using the CPU clock source in each CPU? >> >> Yes, I had verified it and mentioned in one of my reports. It slightly decreased the >> IOPS. I guess, it's a locking contention somewhere. > > For clocksource=cpu there is no internal fio contention, nor can there be any kernel/OS > contention. Getting the clock is serializing, so that might slow things down a bit. Yes. Also, there might be a cache contention here, with one thread writing to a memory location and multiple threads reading from it. The same type of contention why queue spinlocks are faster, than ticket spinlocks. > I've seen you bring up this contention idea before. Yes, it was when I forgot to short circuit lseek calls in the sync engine. Usually, if you see performance dropping from certain number of threads, it is safe to guess there is a lock contention somewhere. > Is that pure guesswork on your end, or have you profiled any contention? Pure guesswork. I'm looking at fio in details only few days, so it is still pretty much a black box for me. Generally, if you see performance drop with another thread, it must be either locks contention, or communication overhead. Nowadays the former is more common, hence the guess. Thanks, Vlad > For most fio workloads there is no cross-job > traffic, that's by design. Fio does add some overhead in general (everything does), and > some of it is opt-out like cutting down on the number of time calls (gtod_cpu). But > that's different from any locking contention between jobs, since it's constant and > isn't affected by how you spread the workloads. > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fio high IOPS measurement mistake 2016-03-05 0:47 ` Vladislav Bolkhovitin @ 2016-03-05 0:54 ` Jens Axboe 2016-03-05 1:09 ` Vladislav Bolkhovitin 0 siblings, 1 reply; 22+ messages in thread From: Jens Axboe @ 2016-03-05 0:54 UTC (permalink / raw) To: Vladislav Bolkhovitin, Sitsofe Wheeler; +Cc: fio@vger.kernel.org On 03/04/2016 05:47 PM, Vladislav Bolkhovitin wrote: > Jens Axboe wrote on 03/04/2016 07:33 AM: >> On 03/03/2016 09:37 PM, Vladislav Bolkhovitin wrote: >>> Jens Axboe wrote on 03/03/2016 08:20 AM: >>>> On Thu, Mar 03 2016, Sitsofe Wheeler wrote: >>>>> On 3 March 2016 at 03:03, Vladislav Bolkhovitin <vst@vlnb.net> wrote: >>>>>> For those who asked about perf profiling, it remained the same as before with the CPU >>>>>> consumption is all about timekeeping and memset: >>>>>> >>>>>> - 55.74% fio fio [.] clock_thread_fn >>>>>> clock_thread_fn >>>>> >>>>> Perhaps this is what is already included above but could you use the >>>>> -g option on perf to collect it into a call-graph and post the top >>>>> results? >>>> >>>> The above looks like a side effect of using gtod_cpu, it'll burn one >>>> core. For the original poster - did you verify whether using gtod_cpu >>>> was faster than using the CPU clock source in each CPU? >>> >>> Yes, I had verified it and mentioned in one of my reports. It slightly decreased the >>> IOPS. I guess, it's a locking contention somewhere. >> >> For clocksource=cpu there is no internal fio contention, nor can there be any kernel/OS >> contention. Getting the clock is serializing, so that might slow things down a bit. > > Yes. Also, there might be a cache contention here, with one thread writing to a memory > location and multiple threads reading from it. The same type of contention why queue > spinlocks are faster, than ticket spinlocks. But cacheline contention for the clock part would be bigger with gtod_cpu, since you have one CPU continually dirtying that cachline and the CPUs with jobs running on them reading it. With clocksource=cpu, that part would not share data. >> I've seen you bring up this contention idea before. > > Yes, it was when I forgot to short circuit lseek calls in the sync engine. Usually, if > you see performance dropping from certain number of threads, it is safe to guess there > is a lock contention somewhere. That is true, the lseek() part is not in fio though, that's a kernel issue. But for this case, avoiding lseek through one of the sync variants with offset is the best solution. Might actually makes sense to make that the default. >> Is that pure guesswork on your end, or have you profiled any contention? > > Pure guesswork. I'm looking at fio in details only few days, so it is still pretty much > a black box for me. Generally, if you see performance drop with another thread, it must > be either locks contention, or communication overhead. Nowadays the former is more > common, hence the guess. Right, but that contention can be anywhere from the application to the driver. So some more details would be great, when you have them. One thing I've seen for higher IOPS cases is that the bdev inode inc/dec for every IO, that hurts scalability. That was fixed here: http://git.kernel.dk/cgit/linux-block/commit/?id=fe0f07d08ee35 but you might already be running a kernel with that included (what are you running?). Another one is the io stats collected, that tends to cause poorer scaling than you would expect, you can turn that off through sysfs (/sys/block/<dev>/queue/iostats). -- Jens Axboe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fio high IOPS measurement mistake 2016-03-05 0:54 ` Jens Axboe @ 2016-03-05 1:09 ` Vladislav Bolkhovitin 0 siblings, 0 replies; 22+ messages in thread From: Vladislav Bolkhovitin @ 2016-03-05 1:09 UTC (permalink / raw) To: Jens Axboe, Sitsofe Wheeler; +Cc: fio@vger.kernel.org Jens Axboe wrote on 03/04/2016 04:54 PM: >>> I've seen you bring up this contention idea before. >> >> Yes, it was when I forgot to short circuit lseek calls in the sync engine. Usually, if >> you see performance dropping from certain number of threads, it is safe to guess there >> is a lock contention somewhere. > > That is true, the lseek() part is not in fio though, that's a kernel issue. But for > this case, avoiding lseek through one of the sync variants with offset is the best > solution. Might actually makes sense to make that the default. BTW, as far as I can see, fio is using independent, fully dedicated file handles for each job. Shouldn't independent file handles, ever for the same block device, behave fully independently with nothing shared, hence protected by locks? In my naive understanding, lseek should be just changing internal offset, a single variable. Perhaps, translation fd (integer) -> corresponding internal structure is what to blame here (also pure guess). >>> Is that pure guesswork on your end, or have you profiled any contention? >> >> Pure guesswork. I'm looking at fio in details only few days, so it is still pretty much >> a black box for me. Generally, if you see performance drop with another thread, it must >> be either locks contention, or communication overhead. Nowadays the former is more >> common, hence the guess. > > Right, but that contention can be anywhere from the application to the driver. So some > more details would be great, when you have them. > > One thing I've seen for higher IOPS cases is that the bdev inode inc/dec for every IO, > that hurts scalability. That was fixed here: > > http://git.kernel.dk/cgit/linux-block/commit/?id=fe0f07d08ee35 > > but you might already be running a kernel with that included (what are you running?). > Another one is the io stats collected, that tends to cause poorer scaling than you > would expect, you can turn that off through sysfs (/sys/block/<dev>/queue/iostats). Thanks, I'm planning to update to the latest kernel and try with it. Vlad ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fio high IOPS measurement mistake 2016-03-03 7:10 ` Sitsofe Wheeler 2016-03-03 7:13 ` Sitsofe Wheeler 2016-03-03 16:20 ` Jens Axboe @ 2016-03-04 4:37 ` Vladislav Bolkhovitin 2 siblings, 0 replies; 22+ messages in thread From: Vladislav Bolkhovitin @ 2016-03-04 4:37 UTC (permalink / raw) To: Sitsofe Wheeler; +Cc: fio@vger.kernel.org Sitsofe Wheeler wrote on 03/02/2016 11:10 PM: > On 3 March 2016 at 03:03, Vladislav Bolkhovitin <vst@vlnb.net> wrote: >> For those who asked about perf profiling, it remained the same as before with the CPU >> consumption is all about timekeeping and memset: >> >> - 55.74% fio fio [.] clock_thread_fn >> clock_thread_fn > > Perhaps this is what is already included above but could you use the > -g option on perf to collect it into a call-graph and post the top > results? My trace was for -g perf option. All entries marked by '-' were expanded. Thanks, Vlad ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fio high IOPS measurement mistake 2016-03-01 5:17 Fio high IOPS measurement mistake Vladislav Bolkhovitin 2016-03-01 6:01 ` Sitsofe Wheeler @ 2016-03-02 8:26 ` Andrey Kuzmin 2016-03-03 3:02 ` Vladislav Bolkhovitin 1 sibling, 1 reply; 22+ messages in thread From: Andrey Kuzmin @ 2016-03-02 8:26 UTC (permalink / raw) To: Vladislav Bolkhovitin; +Cc: fio@vger.kernel.org On Tue, Mar 1, 2016 at 8:17 AM, Vladislav Bolkhovitin <vst@vlnb.net> wrote: > Hello, > > I'm currently looking at one NVRAM device, and during fio tests noticed that each fio > thread consumes 30% of user space CPU. I'm using ioengine=libaio, buffered=0, sync=0 > and direct=1, so user space CPU consumption should be virtually zero. > > That 30% user CPU consumption makes me suspect that this is overhead for internal fio > housekeeping, i.e., scientifically speaking, fio instrumental measurement mistake (I > hope, I'm using correct English terms). > If you believe fio to be 'mistaken', please profile your runs with perf and publish the profile, pointing out what you believe to be a mistake. > Can anybody comment it and suggest how to decrease this user space CPU consumption? > > Here is my full fio job: > > [global] > ioengine=libaio > buffered=0 > sync=0 > direct=1 > randrepeat=1 > softrandommap=1 I suggest you start with switching the random map off as it's known to be expensive and, for some reason, is being maintained even in read-only workloads. Regards, Andrey > rw=randread > bs=4k > filename=./nvram (it's a link to a block device) > exitall=1 > thread=1 > disable_lat=1 > disable_slat=1 > disable_clat=1 > loops=10 > iodepth=16 > > [file1] > > [file2] > > I'm working on million+ IOPS range. > > Thanks, > Vlad > -- > To unsubscribe from this list: send the line "unsubscribe fio" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Fio high IOPS measurement mistake 2016-03-02 8:26 ` Andrey Kuzmin @ 2016-03-03 3:02 ` Vladislav Bolkhovitin 0 siblings, 0 replies; 22+ messages in thread From: Vladislav Bolkhovitin @ 2016-03-03 3:02 UTC (permalink / raw) To: Andrey Kuzmin; +Cc: fio@vger.kernel.org Andrey Kuzmin wrote on 03/02/2016 12:26 AM: > On Tue, Mar 1, 2016 at 8:17 AM, Vladislav Bolkhovitin <vst@vlnb.net> wrote: >> Hello, >> >> I'm currently looking at one NVRAM device, and during fio tests noticed that each fio >> thread consumes 30% of user space CPU. I'm using ioengine=libaio, buffered=0, sync=0 >> and direct=1, so user space CPU consumption should be virtually zero. >> >> That 30% user CPU consumption makes me suspect that this is overhead for internal fio >> housekeeping, i.e., scientifically speaking, fio instrumental measurement mistake (I >> hope, I'm using correct English terms). > > If you believe fio to be 'mistaken', please profile your runs with > perf and publish the profile, > pointing out what you believe to be a mistake. I had done it, see my yesterday's e-mail. >> Can anybody comment it and suggest how to decrease this user space CPU consumption? >> >> Here is my full fio job: >> >> [global] >> ioengine=libaio >> buffered=0 >> sync=0 >> direct=1 >> randrepeat=1 >> softrandommap=1 > > I suggest you start with switching the random map off as it's known to > be expensive and, > for some reason, is being maintained even in read-only workloads. Also done, see the same my e-mail. Thanks, Vlad > Regards, > Andrey > >> rw=randread >> bs=4k >> filename=./nvram (it's a link to a block device) >> exitall=1 >> thread=1 >> disable_lat=1 >> disable_slat=1 >> disable_clat=1 >> loops=10 >> iodepth=16 >> >> [file1] >> >> [file2] >> >> I'm working on million+ IOPS range. >> >> Thanks, >> Vlad ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2016-03-05 1:09 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-03-01 5:17 Fio high IOPS measurement mistake Vladislav Bolkhovitin 2016-03-01 6:01 ` Sitsofe Wheeler 2016-03-02 4:25 ` Vladislav Bolkhovitin 2016-03-02 7:38 ` Sitsofe Wheeler 2016-03-03 3:02 ` Vladislav Bolkhovitin 2016-03-02 18:37 ` Elliott, Robert (Persistent Memory) 2016-03-03 3:03 ` Vladislav Bolkhovitin 2016-03-03 21:03 ` Elliott, Robert (Persistent Memory) 2016-03-04 4:36 ` Vladislav Bolkhovitin 2016-03-03 3:03 ` Vladislav Bolkhovitin 2016-03-03 7:10 ` Sitsofe Wheeler 2016-03-03 7:13 ` Sitsofe Wheeler 2016-03-04 4:37 ` Vladislav Bolkhovitin 2016-03-03 16:20 ` Jens Axboe 2016-03-04 4:37 ` Vladislav Bolkhovitin 2016-03-04 15:33 ` Jens Axboe 2016-03-05 0:47 ` Vladislav Bolkhovitin 2016-03-05 0:54 ` Jens Axboe 2016-03-05 1:09 ` Vladislav Bolkhovitin 2016-03-04 4:37 ` Vladislav Bolkhovitin 2016-03-02 8:26 ` Andrey Kuzmin 2016-03-03 3:02 ` Vladislav Bolkhovitin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox