* RE: fio 3.2 [not found] ` <903d418b-bac4-104b-28e5-3c529efab7f5@kernel.dk> @ 2017-11-26 5:30 ` Gavriliuk, Anton (HPS Ukraine) 2017-11-26 15:28 ` Sitsofe Wheeler 0 siblings, 1 reply; 39+ messages in thread From: Gavriliuk, Anton (HPS Ukraine) @ 2017-11-26 5:30 UTC (permalink / raw) To: fio@vger.kernel.org OK, I rebuild 3.2 with libaio & libpmem libraries, dl560g10spmem01:/var/work/fio-3.2 # rpm -qa|grep libpmem libpmemblk-1.3-3.fc27.x86_64 libpmempool-devel-1.3-3.fc27.x86_64 libpmemlog-1.3-3.fc27.x86_64 libpmem-devel-1.3-3.fc27.x86_64 libpmemobj-devel-1.3-3.fc27.x86_64 libpmem-1.3-3.fc27.x86_64 libpmemobj-1.3-3.fc27.x86_64 libpmemblk-devel-1.3-3.fc27.x86_64 libpmempool-1.3-3.fc27.x86_64 libpmemlog-devel-1.3-3.fc27.x86_64 dl560g10spmem01:/var/work/fio-3.2 # ./configure Operating system Linux CPU x86_64 Big endian no Compiler gcc Cross compile no Static build no Wordsize 64 zlib yes Linux AIO support yes POSIX AIO support yes POSIX AIO support needs -lrt yes POSIX AIO fsync yes POSIX pshared support yes Solaris AIO support no __sync_fetch_and_add yes __sync_synchronize yes __sync_val_compare_and_swap yes libverbs no rdmacm no Linux fallocate yes POSIX fadvise yes POSIX fallocate yes sched_setaffinity(3 arg) yes sched_setaffinity(2 arg) no clock_gettime yes CLOCK_MONOTONIC yes CLOCK_MONOTONIC_RAW yes CLOCK_MONOTONIC_PRECISE no clockid_t yes gettimeofday yes fdatasync yes sync_file_range yes EXT4 move extent yes Linux splice(2) yes GUASI no Fusion-io atomic engine no libnuma yes libnuma v2 yes strsep yes strcasestr yes strlcat no getopt_long_only() yes inet_aton yes socklen_t yes __thread yes RUSAGE_THREAD yes SCHED_IDLE yes TCP_NODELAY yes Net engine window_size yes TCP_MAXSEG yes RLIMIT_MEMLOCK yes pwritev/preadv yes pwritev2/preadv2 no IPv6 helpers yes Rados Block Device engine no rbd blkin tracing no setvbuf yes Gluster API engine no s390_z196_facilities no HDFS engine no MTD yes libpmem yes libpmemblk yes NVML pmemblk engine yes NVML dev-dax engine yes lex/yacc for arithmetic no getmntent yes getmntinfo no Static Assert yes bool yes strndup yes march_armv8_a_crc_crypto no cuda no mkdir(a, b) yes dl560g10spmem01:/var/work/fio-3.2 # make CC crc/crc16.o CC crc/crc32.o CC crc/crc32c-arm64.o CC crc/crc32c-intel.o CC crc/crc32c.o CC crc/crc64.o CC crc/crc7.o CC crc/fnv.o CC crc/md5.o CC crc/murmur3.o CC crc/sha1.o CC crc/sha256.o CC crc/sha3.o CC crc/sha512.o CC crc/test.o CC crc/xxhash.o CC lib/axmap.o CC lib/bloom.o CC lib/flist_sort.o CC lib/gauss.o CC lib/getrusage.o CC lib/hweight.o CC lib/ieee754.o CC lib/lfsr.o CC lib/memalign.o CC lib/mountcheck.o CC lib/num2str.o CC lib/output_buffer.o CC lib/pattern.o CC lib/prio_tree.o CC lib/rand.o CC lib/rbtree.o CC lib/strntol.o CC lib/zipf.o CC gettime.o CC ioengines.o CC init.o CC stat.o CC log.o CC time.o CC filesetup.o CC eta.o CC verify.o CC memory.o CC io_u.o CC parse.o CC mutex.o CC options.o CC smalloc.o CC filehash.o CC profile.o CC debug.o CC engines/cpu.o CC engines/mmap.o CC engines/sync.o CC engines/null.o CC engines/net.o CC engines/ftruncate.o CC engines/filecreate.o CC server.o CC client.o CC iolog.o CC backend.o CC libfio.o CC flow.o CC cconv.o CC gettime-thread.o CC helpers.o CC json.o CC idletime.o CC td_error.o CC profiles/tiobench.o CC profiles/act.o CC io_u_queue.o CC filelock.o CC workqueue.o CC rate-submit.o CC optgroup.o CC helper_thread.o CC steadystate.o CC engines/libaio.o CC engines/posixaio.o CC engines/falloc.o CC engines/e4defrag.o CC engines/splice.o CC oslib/strlcat.o CC engines/mtd.o CC oslib/libmtd.o CC oslib/libmtd_legacy.o CC engines/pmemblk.o CC engines/dev-dax.o CC diskutil.o CC fifo.o CC blktrace.o CC cgroup.o CC trim.o CC engines/sg.o CC engines/binject.o CC oslib/linux-dev-lookup.o LINK fio LINK t/fio-genzipf LINK t/fio-btrace2fio LINK t/fio-dedupe LINK t/fio-verify-state LINK t/stest LINK t/ieee754 LINK t/axmap LINK t/lfsr-test LINK t/gen-rand dl560g10spmem01:/var/work/fio-3.2 # /usr/local/bin/fio --enghelp Available IO engines: cpuio mmap sync psync vsync pvsync pvsync2 null net netsplice ftruncate filecreate libaio posixaio falloc e4defrag splice mtd pmemblk dev-dax sg binject dl560g10spmem01:/var/work/fio-3.2 # /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=dev-dax --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=1 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --size=300Gb --name=4-rand-rw-3xx 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=dev-dax, iodepth=1 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=dev-dax, iodepth=1 fio-3.2 Starting 2 processes Run status group 0 (all jobs): dl560g10spmem01:/var/work/fio-3.2 # /dev/pmem1 on /pmem type xfs (rw,relatime,attr2,dax,inode64,noquota) dl560g10spmem01:/var/work/fio-3.2 # /usr/local/bin/fio --filename=/dev/pmem1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=pmemblk --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=1 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --allow_mounted_write=1 --thread=1 --direct=1 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=pmemblk, iodepth=1 fio-3.2 Starting 1 thread pmemblk: unable to open pmemblk pool file /dev/pmem1 (Invalid argument) Run status group 0 (all jobs): dl560g10spmem01:/var/work/fio-3.2 # dl560g10spmem01:/var/work/fio-3.2 # ll /dev/dax0.0 -rw-r--r-- 1 root root 322122547200 Nov 25 22:16 /dev/dax0.0 dl560g10spmem01:/var/work/fio-3.2 # ll /dev/pmem1 brw-rw---- 1 root disk 259, 3 Nov 25 21:51 /dev/pmem1 No error messages in dmesg/messages. And the libpmem engine still not available. Anton -----Original Message----- From: Jens Axboe [mailto:axboe@kernel.dk] Sent: Saturday, November 25, 2017 7:40 PM To: Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com> Subject: Re: fio 3.2 See the BUILDING section of the README. You are missing libaio-dev(el) and ditto for pmem. When you run configure, it shows you what it detects and what it doesn't. In the future, please send support questions to the reflector, as also described in the documentation, fio@vger.kernel.org. I can't cater to all personal support requests, and there are plenty of other folks on that list that can help you out, especially on basic questions like this. On 11/25/2017 08:50 AM, Gavriliuk, Anton (HPS Ukraine) wrote: > In 3.2 version libaio engine also missed., can't be loaded > > � > > Anton > > � > > *From:* Gavriliuk, Anton (HPS Ukraine) > *Sent:* Saturday, November 25, 2017 9:12 AM > *To:* 'jens.axboe@oracle.com' <jens.axboe@oracle.com>; > 'jaxboe@fusionio.com' <jaxboe@fusionio.com>; 'axboe@fb.com' > <axboe@fb.com> > *Subject:* fio 3.2 > > � > > Hello Jens > > � > > I'm testing persistent memory using latest fio 3.2 > > � > > http://fio.readthedocs.io/en/latest/fio_doc.html#building > <https://urldefense.proofpoint.com/v2/url?u=http-3A__fio.readthedocs.i > o_en_latest_fio-5Fdoc.html-23building&d=DwMFAg&c=5VD0RTtNlTh3ycd41b3MU > w&r=cK1a7KivzZRh1fKQMjSm2A&m=gF-hXqu35gNKL7tvHVTPbXTfipn8kMan45lnr9z1C > kA&s=9wSt7cGFX26GOxsXLnQ6pALYuxas7yjYsYsV-AWIMhA&e=> > > � > > 1.12.10. I/O engine > > pmemblk > > Read and write using filesystem DAX to a file on a filesystem mounted with DAX on a persistent memory device through the NVML libpmemblk library. > > dev-dax > > Read and write using device DAX to a persistent memory device (e.g., /dev/dax0.0) through the NVML libpmem library. > > libpmem > > Read and write using mmap I/O to a file on a filesystem mounted with DAX on a persistent memory device through the NVML libpmem library. > > � > > dl560g10spmem01:/var/work/fio-3.2 # /usr/local/bin/fio --version > > fio-3.2 > > dl560g10spmem01:/var/work/fio-3.2 # /usr/local/bin/fio --enghelp > > Available IO engines: > > ������� cpuio > > ������� mmap > > ������� sync > > ������� psync > > ������� vsync > > ������� pvsync > > ������� pvsync2 > > ������� null > > ������� net > > ������� netsplice > > ����� ��ftruncate > > ������� filecreate > > ������� posixaio > > ������� falloc > > ������� e4defrag > > ������� splice > > ������� mtd > > ������� sg > > ������� binject > > dl560g10spmem01:/var/work/fio-3.2 # ll /dev|grep dax > > crw-------� 1 root root��� 243,�� 0 Nov 25 00:12 dax0.0 > > dl560g10spmem01:/var/work/fio-3.2 # ll /dev|grep pmem > > brw-rw----� 1 root disk��� 259,�� 2 Nov 25 00:35 pmem1 > > brw-rw----� 1 root disk��� 259,�� 0 Nov 25 00:12 pmem2 > > brw-rw----� 1 root disk��� 259,�� 1 Nov 25 00:12 pmem3 > > dl560g10spmem01:/var/work/fio-3.2 # /usr/local/bin/fio > --filename=/dev/dax0.0� --direct=1 --rw=randrw --refill_buffers > --norandommap --randrepeat=0 --ioengine=dev-dax > --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 > --iodepth=1 --numjobs=16 --runtime=180 --group_reporting > --name=4-rand-rw-3xx > > fio: engine dev-dax not loadable > > fio: failed to load engine > > dl560g10spmem01:/var/work/fio-3.2 # > > � > > I can't access to the 3 persistent memory engines above. > > � > > May you please help me with loading these engines. > > � > > � > > *Anton Gavriliuk* > > Critical Support > > Performance Engineering > -- Jens Axboe ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: fio 3.2 2017-11-26 5:30 ` fio 3.2 Gavriliuk, Anton (HPS Ukraine) @ 2017-11-26 15:28 ` Sitsofe Wheeler 2017-11-27 12:39 ` Gavriliuk, Anton (HPS Ukraine) 0 siblings, 1 reply; 39+ messages in thread From: Sitsofe Wheeler @ 2017-11-26 15:28 UTC (permalink / raw) To: Gavriliuk, Anton (HPS Ukraine); +Cc: fio@vger.kernel.org On 26 November 2017 at 05:30, Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com> wrote: > OK, I rebuild 3.2 with libaio & libpmem libraries, > > dl560g10spmem01:/var/work/fio-3.2 # rpm -qa|grep libpmem > libpmemblk-1.3-3.fc27.x86_64 > libpmempool-devel-1.3-3.fc27. [snip] > libpmem yes > libpmemblk yes > NVML pmemblk engine yes > NVML dev-dax engine yes OK so fio has been built with pmem support. [...] > dl560g10spmem01:/var/work/fio-3.2 # /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=dev-dax --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=1 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --size=300Gb --name=4-rand-rw-3xx > 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=dev-dax, iodepth=1 > 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=dev-dax, iodepth=1 > fio-3.2 > Starting 2 processes > > Run status group 0 (all jobs): Where's the output of this job? Did it work? > dl560g10spmem01:/var/work/fio-3.2 # > > /dev/pmem1 on /pmem type xfs (rw,relatime,attr2,dax,inode64,noquota) ^^^ So your dax filesystem appears at /pmem ... > dl560g10spmem01:/var/work/fio-3.2 # /usr/local/bin/fio --filename=/dev/pmem1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=pmemblk --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=1 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --allow_mounted_write=1 --thread=1 --direct=1 > 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=pmemblk, iodepth=1 > fio-3.2 > Starting 1 thread > pmemblk: unable to open pmemblk pool file /dev/pmem1 (Invalid argument) ...but you're using filename=/dev/pmem1 with pmemblk? At least the filename option you used looks incorrect for this ioengine - have you seen the comments and filenames in the example over on https://github.com/axboe/fio/blob/fio-3.2/examples/pmemblk.fio ? Also are you sure you want to be setting allow_mounted_write=1 in this case - what should be mounted where the I/O happens - won't pmemblk create the "device" itself? -- Sitsofe | http://sucs.org/~sits/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: fio 3.2 2017-11-26 15:28 ` Sitsofe Wheeler @ 2017-11-27 12:39 ` Gavriliuk, Anton (HPS Ukraine) 2017-11-27 19:38 ` Sitsofe Wheeler 0 siblings, 1 reply; 39+ messages in thread From: Gavriliuk, Anton (HPS Ukraine) @ 2017-11-27 12:39 UTC (permalink / raw) To: Sitsofe Wheeler; +Cc: fio@vger.kernel.org I'm trying to test PMEM using DAX/mmap(), but there're no outputs & errors, dl560g10spmem01:~/FIO-PMEM_TestScripts/DAX # ll /dev|grep pmem brw-rw---- 1 root disk 259, 0 Nov 27 05:50 pmem1 brw-rw---- 1 root disk 259, 1 Nov 27 05:50 pmem2 brw-rw---- 1 root disk 259, 2 Nov 27 05:50 pmem3 dl560g10spmem01:~/FIO-PMEM_TestScripts/DAX # ll /dev|grep dax crw------- 1 root root 243, 0 Nov 27 05:50 dax0.0 dl560g10spmem01:~/FIO-PMEM_TestScripts/DAX # ll total 36 -rwxrwxrwx 1 root root 110 Nov 23 20:44 cleanupDAX.sh -rwxrwxrwx 1 root root 381 Nov 27 05:55 run-fio-db-64krw.sh -rwxr-xr-x 1 root root 361 Nov 23 20:51 run-fio-db-64ksw-numaopt.sh -rwxrwxrwx 1 root root 361 Nov 23 20:47 run-fio-db-64ksw.sh -rwxrwxrwx 1 root root 375 Nov 23 20:49 run-fio-db-8krr.sh -rwxrwxrwx 1 root root 636 Nov 23 20:49 run-fio-db-8krrw.sh -rwxrwxrwx 1 root root 339 Nov 23 20:45 run-itDAX.sh -rwxrwxrwx 1 root root 132 Nov 23 20:46 setupDAX.sh -rw-r--r-- 1 root root 117 Nov 23 20:16 testDAX.out dl560g10spmem01:~/FIO-PMEM_TestScripts/DAX # dl560g10spmem01:~/FIO-PMEM_TestScripts/DAX # cat run-fio-db-64krw.sh #! /bin/sh echo "Running random write 64k 32 IO 1024m file 8 threads on DAX PMEM" result=$(fio --name=random-writers --ioengine=mmap --iodepth=32 --rw=randwrite --bs=64k --size=1024m --numjobs=8 --group_reporting=1 --eta=never --time_based --runtime=60 --filename=/dev/dax0.0 | grep WRITE) agg=$(echo $result|cut -d',' -f2) speed="${agg/aggrb=/Write Speed:}" echo $speed echo "" dl560g10spmem01:~/FIO-PMEM_TestScripts/DAX # ./run-fio-db-64krw.sh Running random write 64k 32 IO 1024m file 8 threads on DAX PMEM dl560g10spmem01:~/FIO-PMEM_TestScripts/DAX # ./run-fio-db-64ksw-numaopt.sh Running seq write 64k 1 IO 3g file 4 threads on DAX PMEM dl560g10spmem01:~/FIO-PMEM_TestScripts/DAX # ./run-fio-db-64ksw.sh Running seq write 64k 1 IO 3g file 4 threads on DAX PMEM dl560g10spmem01:~/FIO-PMEM_TestScripts/DAX # ./run-fio-db-8krr.sh Running random read 8k 32 IO 1024m file 8 threads on DAX PMEM dl560g10spmem01:~/FIO-PMEM_TestScripts/DAX # ./run-fio-db-8krrw.sh Running random read/write 60/40 8k 32 IO 1024m file 8 threads on DAX PMEM dl560g10spmem01:~/FIO-PMEM_TestScripts/DAX # -----Original Message----- From: Sitsofe Wheeler [mailto:sitsofe@gmail.com] Sent: Sunday, November 26, 2017 5:28 PM To: Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com> Cc: fio@vger.kernel.org Subject: Re: fio 3.2 On 26 November 2017 at 05:30, Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com> wrote: > OK, I rebuild 3.2 with libaio & libpmem libraries, > > dl560g10spmem01:/var/work/fio-3.2 # rpm -qa|grep libpmem > libpmemblk-1.3-3.fc27.x86_64 > libpmempool-devel-1.3-3.fc27. [snip] > libpmem yes > libpmemblk yes > NVML pmemblk engine yes > NVML dev-dax engine yes OK so fio has been built with pmem support. [...] > dl560g10spmem01:/var/work/fio-3.2 # /usr/local/bin/fio > --filename=/dev/dax0.0 --rw=randrw --refill_buffers --norandommap > --randrepeat=0 --ioengine=dev-dax > --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 > --iodepth=1 --numjobs=1 --runtime=1800 --group_reporting > --name=4-rand-rw-3xx --size=300Gb --name=4-rand-rw-3xx > 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) > 4096B-256KiB, (T) 4096B-256KiB, ioengine=dev-dax, iodepth=1 > 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) > 4096B-256KiB, (T) 4096B-256KiB, ioengine=dev-dax, iodepth=1 > fio-3.2 > Starting 2 processes > > Run status group 0 (all jobs): Where's the output of this job? Did it work? > dl560g10spmem01:/var/work/fio-3.2 # > > /dev/pmem1 on /pmem type xfs (rw,relatime,attr2,dax,inode64,noquota) ^^^ So your dax filesystem appears at /pmem ... > dl560g10spmem01:/var/work/fio-3.2 # /usr/local/bin/fio > --filename=/dev/pmem1 --rw=randrw --refill_buffers --norandommap > --randrepeat=0 --ioengine=pmemblk > --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 > --iodepth=1 --numjobs=1 --runtime=1800 --group_reporting > --name=4-rand-rw-3xx --allow_mounted_write=1 --thread=1 --direct=1 > 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) > 4096B-256KiB, (T) 4096B-256KiB, ioengine=pmemblk, iodepth=1 > fio-3.2 > Starting 1 thread > pmemblk: unable to open pmemblk pool file /dev/pmem1 (Invalid > argument) ...but you're using filename=/dev/pmem1 with pmemblk? At least the filename option you used looks incorrect for this ioengine - have you seen the comments and filenames in the example over on https://github.com/axboe/fio/blob/fio-3.2/examples/pmemblk.fio ? Also are you sure you want to be setting allow_mounted_write=1 in this case - what should be mounted where the I/O happens - won't pmemblk create the "device" itself? -- Sitsofe | http://sucs.org/~sits/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: fio 3.2 2017-11-27 12:39 ` Gavriliuk, Anton (HPS Ukraine) @ 2017-11-27 19:38 ` Sitsofe Wheeler 2017-11-27 22:48 ` Rebecca Cran ` (2 more replies) 0 siblings, 3 replies; 39+ messages in thread From: Sitsofe Wheeler @ 2017-11-27 19:38 UTC (permalink / raw) To: Gavriliuk, Anton (HPS Ukraine) Cc: fio@vger.kernel.org, Brian Boylston, Robert Elliott Unfortunately I don't have access to a pmem device but let's see how far we get: On 27 November 2017 at 12:39, Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com> wrote: > > result=$(fio --name=random-writers --ioengine=mmap --iodepth=32 --rw=randwrite --bs=64k --size=1024m --numjobs=8 --group_reporting=1 --eta=never --time_based --runtime=60 --filename=/dev/dax0.0 | grep WRITE) Please make your problem scenarios as simple as possible: 1. Just run fio normally so we can see the output it produces on both stdout and stderr normally. 2. Reduce the job back so it there are only the bare minimum options that reproduce the problem. 3. Try to avoid changing lots of things. Here you've switched ioengine introducing another place to look. Instead how about this: Was your /dev/dax0.0 device made using ndctl? Assuming yes: fio --name=dax-mmap --ioengine=mmap --rw=write --bs=64k --eta=never --time_based --runtime=60 --filename=/dev/dax0.0 --size=2g (apparently a size has to be specified when you try to use a character device - see https://nvdimm.wiki.kernel.org/ ) If you run just that by itself what do you see? Next up: fio --name=dax-dax --ioengine=dev-dax --rw=write --bs=64k --eta=never --time_based --runtime=60 --filename=/dev/dax0.0 --size=2g If you run just that by itself what do you see? Finally: Assuming a -o dax mounted filesystem on /pmem0/ : fio --name=libpmemblk --ioengine=pmemblk --rw=write --bs=64k --eta=never --time_based --runtime=60 --filename=/pmem0/fio-test,4096,1024 --thread=1 If you run just that by itself what do you see? (Perhaps the documentation for these ioengines and pmem devices needs to be improved?) -- Sitsofe | http://sucs.org/~sits/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: fio 3.2 2017-11-27 19:38 ` Sitsofe Wheeler @ 2017-11-27 22:48 ` Rebecca Cran 2017-11-28 1:12 ` Gavriliuk, Anton (HPS Ukraine) [not found] ` <F5206176-8A92-46B4-9092-F901C132E22C@bluestop.org> 2017-11-28 14:25 ` Gavriliuk, Anton (HPS Ukraine) 2 siblings, 1 reply; 39+ messages in thread From: Rebecca Cran @ 2017-11-27 22:48 UTC (permalink / raw) To: Sitsofe Wheeler Cc: Gavriliuk, Anton (HPS Ukraine), fio@vger.kernel.org, Brian Boylston, Robert Elliott If you don’t have a pmem device, Linux can emulate one using RAM and the ‘memmap’ kernel parameter. See http://pmem.io/2016/02/22/pm-emulation.html for more info. Rebecca Sent from my iPhone On Nov 27, 2017, at 12:38 PM, Sitsofe Wheeler <sitsofe@gmail.com <mailto:sitsofe@gmail.com>> wrote: > Unfortunately I don't have access to a pmem device but let's see how > far we get: > > On 27 November 2017 at 12:39, Gavriliuk, Anton (HPS Ukraine) > <anton.gavriliuk@hpe.com <mailto:anton.gavriliuk@hpe.com>> wrote: >> >> result=$(fio --name=random-writers --ioengine=mmap --iodepth=32 >> --rw=randwrite --bs=64k --size=1024m --numjobs=8 --group_reporting=1 >> --eta=never --time_based --runtime=60 --filename=/dev/dax0.0 | grep >> WRITE) > > Please make your problem scenarios as simple as possible: > 1. Just run fio normally so we can see the output it produces on both > stdout and stderr normally. > 2. Reduce the job back so it there are only the bare minimum options > that reproduce the problem. > 3. Try to avoid changing lots of things. > > Here you've switched ioengine introducing another place to look. > Instead how about this: > > Was your /dev/dax0.0 device made using ndctl? Assuming yes: > > fio --name=dax-mmap --ioengine=mmap --rw=write --bs=64k --eta=never > --time_based --runtime=60 --filename=/dev/dax0.0 --size=2g > > (apparently a size has to be specified when you try to use a character > device - see https://nvdimm.wiki.kernel.org/ ) > > If you run just that by itself what do you see? > > Next up: > > fio --name=dax-dax --ioengine=dev-dax --rw=write --bs=64k --eta=never > --time_based --runtime=60 --filename=/dev/dax0.0 --size=2g > > If you run just that by itself what do you see? > > Finally: > Assuming a -o dax mounted filesystem on /pmem0/ : > > fio --name=libpmemblk --ioengine=pmemblk --rw=write --bs=64k > --eta=never --time_based --runtime=60 > --filename=/pmem0/fio-test,4096,1024 --thread=1 > > If you run just that by itself what do you see? > > (Perhaps the documentation for these ioengines and pmem devices needs > to be improved?) > > -- > Sitsofe | http://sucs.org/~sits/ <http://sucs.org/%7Esits/> > -- > To unsubscribe from this list: send the line "unsubscribe fio" in > the body of a message to majordomo@vger.kernel.org > <mailto:majordomo@vger.kernel.org> > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: fio 3.2 2017-11-27 22:48 ` Rebecca Cran @ 2017-11-28 1:12 ` Gavriliuk, Anton (HPS Ukraine) 2017-11-28 4:51 ` Elliott, Robert (Persistent Memory) 0 siblings, 1 reply; 39+ messages in thread From: Gavriliuk, Anton (HPS Ukraine) @ 2017-11-28 1:12 UTC (permalink / raw) To: Rebecca Cran, Sitsofe Wheeler Cc: fio@vger.kernel.org, Brian Boylston, Elliott, Robert (Persistent Memory) No, I have true 4 pmem devices right now, 300Gb each, dl560g10spmem01:~ # ll /dev|grep dax crw------- 1 root root 243, 0 Nov 27 05:50 dax0.0 dl560g10spmem01:~ # ll /dev|grep pmem brw-rw---- 1 root disk 259, 0 Nov 27 11:38 pmem1 brw-rw---- 1 root disk 259, 1 Nov 27 12:39 pmem2 brw-rw---- 1 root disk 259, 2 Nov 27 05:50 pmem3 Anton -----Original Message----- From: Rebecca Cran [mailto:rebecca@bluestop.org] Sent: Tuesday, November 28, 2017 12:48 AM To: Sitsofe Wheeler <sitsofe@gmail.com> Cc: Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com>; fio@vger.kernel.org; Brian Boylston <brian.boylston@hpe.com>; Elliott, Robert (Persistent Memory) <elliott@hpe.com> Subject: Re: fio 3.2 If you don’t have a pmem device, Linux can emulate one using RAM and the ‘memmap’ kernel parameter. See http://pmem.io/2016/02/22/pm-emulation.html for more info. Rebecca Sent from my iPhone On Nov 27, 2017, at 12:38 PM, Sitsofe Wheeler <sitsofe@gmail.com <mailto:sitsofe@gmail.com>> wrote: > Unfortunately I don't have access to a pmem device but let's see how > far we get: > > On 27 November 2017 at 12:39, Gavriliuk, Anton (HPS Ukraine) > <anton.gavriliuk@hpe.com <mailto:anton.gavriliuk@hpe.com>> wrote: >> >> result=$(fio --name=random-writers --ioengine=mmap --iodepth=32 >> --rw=randwrite --bs=64k --size=1024m --numjobs=8 --group_reporting=1 >> --eta=never --time_based --runtime=60 --filename=/dev/dax0.0 | grep >> WRITE) > > Please make your problem scenarios as simple as possible: > 1. Just run fio normally so we can see the output it produces on both > stdout and stderr normally. > 2. Reduce the job back so it there are only the bare minimum options > that reproduce the problem. > 3. Try to avoid changing lots of things. > > Here you've switched ioengine introducing another place to look. > Instead how about this: > > Was your /dev/dax0.0 device made using ndctl? Assuming yes: > > fio --name=dax-mmap --ioengine=mmap --rw=write --bs=64k --eta=never > --time_based --runtime=60 --filename=/dev/dax0.0 --size=2g > > (apparently a size has to be specified when you try to use a character > device - see https://nvdimm.wiki.kernel.org/ ) > > If you run just that by itself what do you see? > > Next up: > > fio --name=dax-dax --ioengine=dev-dax --rw=write --bs=64k --eta=never > --time_based --runtime=60 --filename=/dev/dax0.0 --size=2g > > If you run just that by itself what do you see? > > Finally: > Assuming a -o dax mounted filesystem on /pmem0/ : > > fio --name=libpmemblk --ioengine=pmemblk --rw=write --bs=64k > --eta=never --time_based --runtime=60 > --filename=/pmem0/fio-test,4096,1024 --thread=1 > > If you run just that by itself what do you see? > > (Perhaps the documentation for these ioengines and pmem devices needs > to be improved?) > > -- > Sitsofe | http://sucs.org/~sits/ <http://sucs.org/%7Esits/> > -- > To unsubscribe from this list: send the line "unsubscribe fio" in the > body of a message to majordomo@vger.kernel.org > <mailto:majordomo@vger.kernel.org> > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: fio 3.2 2017-11-28 1:12 ` Gavriliuk, Anton (HPS Ukraine) @ 2017-11-28 4:51 ` Elliott, Robert (Persistent Memory) 2017-11-29 3:35 ` Gavriliuk, Anton (HPS Ukraine) 0 siblings, 1 reply; 39+ messages in thread From: Elliott, Robert (Persistent Memory) @ 2017-11-28 4:51 UTC (permalink / raw) To: Gavriliuk, Anton (HPS Ukraine), Rebecca Cran, Sitsofe Wheeler Cc: fio@vger.kernel.org, Kani, Toshimitsu > -----Original Message----- > From: Gavriliuk, Anton (HPS Ukraine) > Sent: Monday, November 27, 2017 7:13 PM > Subject: RE: fio 3.2 > > No, I have true 4 pmem devices right now, 300Gb each, I think that suggestion was for Sitsofe. > On Nov 27, 2017, at 12:38 PM, Sitsofe Wheeler <sitsofe@gmail.com > <mailto:sitsofe@gmail.com>> wrote: > > > Unfortunately I don't have access to a pmem device but let's see how > > far we get: > > > > On 27 November 2017 at 12:39, Gavriliuk, Anton (HPS Ukraine) > > <anton.gavriliuk@hpe.com <mailto:anton.gavriliuk@hpe.com>> wrote: > >> > >> result=$(fio --name=random-writers --ioengine=mmap --iodepth=32 > >> --rw=randwrite --bs=64k --size=1024m --numjobs=8 --group_reporting=1 > >> --eta=never --time_based --runtime=60 --filename=/dev/dax0.0 | grep > >> WRITE) ... > > Here you've switched ioengine introducing another place to look. > > Instead how about this: > > ... > > (apparently a size has to be specified when you try to use a character > > device - see https://nvdimm.wiki.kernel.org/ ) > > ... > > (Perhaps the documentation for these ioengines and pmem devices needs > > to be improved?) There are several oddities with device DAX (/dev/dax0.0 character devices, providing a window to persistent memory representing the whole device) compared to filesystem DAX (mounting ext4 or xfs with -o dax and using mmap to access files). 1. I recommend using allow_file_create=0 to ensure fio doesn't just create a plain file called "/dev/dax0.0" on your regular storage device and do its I/Os to that. 2. This script will convert 16 pmem devices into device DAX mode (trashing all data in the process): #!/bin/bash for i in {0..15} do echo working on $i # -M map puts struct page data in regular memory, not nvdimm memory # that's fine for NVDIMM-Ns; intended for larger capacities ndctl create-namespace -m dax -M mem -e namespace$i.0 -f done ndctl carves out some space for device DAX metadata. For 8 GiB of persistent memory, the resulting character device sizes are: -M mem: 8587837440 bytes = 8190 MiB (8 GiB - 2 MiB) -M dev: 8453619712 bytes = 8062 MiB (8 GiB - 2 MiB - 128 MiB) Since /dev/dax is a character device, it has no "size", so you must manually tell it the size. We should patch fio to automatically detect that from /sys/class/dax/dax0.0/size in linux. I don't know how this would work in Windows. 3. One possible issue is the mmap ioengine MMAP_TOTAL_SZ define "limits us to 1GiB of mapped files in total". I expect real software using device DAX will want to map the entire device with one mmap() call, then perform loads, stores, cache flushes, etc. - things that the NVML libraries help it do correctly. As is, I think fio keeps mapping and unmapping, which exercises the kernel more than the hardware. 4. The CPU instructions used by fio depend on the glibc library version. As mentioned in an earlier fio thread, that changes a lot. With libc2.24.so, random reads seem to be done with rep movsb. norandommap, randrepeat=0, zero_buffers, and gtod_reduce=1 help reduce fio overhead at these rates. I don't think iodepth is used by the mmap ioengine. 5. If the blocksize * number of threads exceeds the CPU cache size, you'll start generating lots of traffic to regular memory in addition to persistent memory. Example: 36 threads generating 64 KiB reads from /dev/dax0.0 ends up reading read 11 GB/s from persistent memory; the writes stay in CPU cache (no memory write traffic). |---------------------------------------||---------------------------------------| |-- Socket 0 --||-- Socket 1 --| |---------------------------------------||---------------------------------------| |-- Memory Channel Monitoring --||-- Memory Channel Monitoring --| |---------------------------------------||---------------------------------------| |-- Mem Ch 0: Reads (MB/s): 11022.33 --||-- Mem Ch 0: Reads (MB/s): 15.60 --| |-- Writes(MB/s): 22.29 --||-- Writes(MB/s): 7.09 --| |-- Mem Ch 1: Reads (MB/s): 56.72 --||-- Mem Ch 1: Reads (MB/s): 9.34 --| |-- Writes(MB/s): 17.03 --||-- Writes(MB/s): 0.81 --| |-- Mem Ch 4: Reads (MB/s): 60.12 --||-- Mem Ch 4: Reads (MB/s): 15.79 --| |-- Writes(MB/s): 23.66 --||-- Writes(MB/s): 7.10 --| |-- Mem Ch 5: Reads (MB/s): 54.94 --||-- Mem Ch 5: Reads (MB/s): 9.65 --| |-- Writes(MB/s): 17.30 --||-- Writes(MB/s): 1.02 --| |-- NODE 0 Mem Read (MB/s) : 11194.11 --||-- NODE 1 Mem Read (MB/s) : 50.37 --| |-- NODE 0 Mem Write(MB/s) : 80.28 --||-- NODE 1 Mem Write(MB/s) : 16.02 --| |-- NODE 0 P. Write (T/s): 192832 --||-- NODE 1 P. Write (T/s): 190615 --| |-- NODE 0 Memory (MB/s): 11274.38 --||-- NODE 1 Memory (MB/s): 66.39 --| |---------------------------------------||---------------------------------------| |---------------------------------------||---------------------------------------| |-- System Read Throughput(MB/s): 11332.59 --| |-- System Write Throughput(MB/s): 96.23 --| |-- System Memory Throughput(MB/s): 11428.82 --| |---------------------------------------||---------------------------------------| Increasing to 2 MiB reads, the bandwidth from persistent memory drops to 9 GB/s, most of those turn into writes to regular memory (with some reads as well as the caches thrash). |---------------------------------------||---------------------------------------| |-- Socket 0 --||-- Socket 1 --| |---------------------------------------||---------------------------------------| |-- Memory Channel Monitoring --||-- Memory Channel Monitoring --| |---------------------------------------||---------------------------------------| |-- Mem Ch 0: Reads (MB/s): 9069.96 --||-- Mem Ch 0: Reads (MB/s): 18.33 --| |-- Writes(MB/s): 2026.71 --||-- Writes(MB/s): 8.44 --| |-- Mem Ch 1: Reads (MB/s): 1070.84 --||-- Mem Ch 1: Reads (MB/s): 11.08 --| |-- Writes(MB/s): 2021.61 --||-- Writes(MB/s): 2.11 --| |-- Mem Ch 4: Reads (MB/s): 1069.20 --||-- Mem Ch 4: Reads (MB/s): 17.85 --| |-- Writes(MB/s): 2028.27 --||-- Writes(MB/s): 8.51 --| |-- Mem Ch 5: Reads (MB/s): 1062.81 --||-- Mem Ch 5: Reads (MB/s): 11.59 --| |-- Writes(MB/s): 2021.74 --||-- Writes(MB/s): 2.31 --| |-- NODE 0 Mem Read (MB/s) : 12272.82 --||-- NODE 1 Mem Read (MB/s) : 58.85 --| |-- NODE 0 Mem Write(MB/s) : 8098.33 --||-- NODE 1 Mem Write(MB/s) : 21.36 --| |-- NODE 0 P. Write (T/s): 220528 --||-- NODE 1 P. Write (T/s): 190643 --| |-- NODE 0 Memory (MB/s): 20371.14 --||-- NODE 1 Memory (MB/s): 80.21 --| |---------------------------------------||---------------------------------------| |---------------------------------------||---------------------------------------| |-- System Read Throughput(MB/s): 12331.67 --| |-- System Write Throughput(MB/s): 8119.69 --| |-- System Memory Throughput(MB/s): 20451.36 --| |---------------------------------------||---------------------------------------| Example simple script: [random-test] kb_base=1000 ioengine=mmap iodepth=1 rw=randread bs=64KiB size=7GiB numjobs=36 cpus_allowed_policy=split cpus_allowed=0-17,36-53 group_reporting=1 time_based runtime=60 allow_file_create=0 filename=/dev/dax0.0 Example more aggressive script: [global] kb_base=1000 ioengine=mmap iodepth=1 rw=randread bs=64KiB ba=64KiB numjobs=36 cpus_allowed_policy=split cpus_allowed=0-17,36-53 group_reporting=1 norandommap randrepeat=0 zero_buffers gtod_reduce=1 time_based runtime=60999 allow_file_create=0 [d0] size=7GiB filename=/dev/dax0.0 [d1] size=7GiB filename=/dev/dax1.0 [d2] size=7GiB filename=/dev/dax2.0 [d3] size=7GiB filename=/dev/dax3.0 [d4] size=7GiB filename=/dev/dax4.0 [d5] size=7GiB filename=/dev/dax5.0 [d6] size=7GiB filename=/dev/dax6.0 [d7] size=7GiB filename=/dev/dax7.0 ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: fio 3.2 2017-11-28 4:51 ` Elliott, Robert (Persistent Memory) @ 2017-11-29 3:35 ` Gavriliuk, Anton (HPS Ukraine) 2017-11-29 4:13 ` Elliott, Robert (Persistent Memory) 0 siblings, 1 reply; 39+ messages in thread From: Gavriliuk, Anton (HPS Ukraine) @ 2017-11-29 3:35 UTC (permalink / raw) To: Elliott, Robert (Persistent Memory), Rebecca Cran, Sitsofe Wheeler Cc: fio@vger.kernel.org, Kani, Toshimitsu Sorry Elliott, seems to be I missed your mail. 1. I recommend using allow_file_create=0 to ensure fio doesn't just create a plain file called "/dev/dax0.0" on your regular storage device and do its I/Os to that. Done, no visible improvements detected, dl560g10spmem01:~ # /usr/bin/fio --name=4-rand-rw-3xx --ioengine=mmap --iodepth=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --size=290g --numjobs=16 --group_reporting --runtime=120 --filename=/dev/dax0.0 --allow_file_create=0 4-rand-rw-3xx: (g=0): rw=randrw, bs=4K-256K/4K-256K/4K-256K, ioengine=mmap, iodepth=1 ... fio-2.12 Starting 16 processes Jobs: 14 (f=14): [m(10),_(1),m(4),_(1)] [78.1% done] [1569MB/29903MB/0KB /s] [30.1K/588K/0 iops] [eta 00m:34s] 4-rand-rw-3xx: (groupid=0, jobs=16): err= 0: pid=18988: Tue Nov 28 21:30:12 2017 read : io=204940MB, bw=1707.9MB/s, iops=33575, runt=120001msec clat (usec): min=0, max=1008, avg= 8.85, stdev= 5.25 lat (usec): min=0, max=1008, avg= 8.89, stdev= 5.25 clat percentiles (usec): | 1.00th=[ 1], 5.00th=[ 1], 10.00th=[ 2], 20.00th=[ 4], | 30.00th=[ 6], 40.00th=[ 8], 50.00th=[ 9], 60.00th=[ 11], | 70.00th=[ 11], 80.00th=[ 12], 90.00th=[ 13], 95.00th=[ 14], | 99.00th=[ 29], 99.50th=[ 43], 99.90th=[ 48], 99.95th=[ 49], | 99.99th=[ 53] bw (KB /s): min=68034, max=158528, per=6.25%, avg=109348.46, stdev=19939.95 write: io=3798.1GB, bw=32417MB/s, iops=637449, runt=120001msec clat (usec): min=0, max=616, avg=15.13, stdev=10.08 lat (usec): min=0, max=717, avg=15.17, stdev=10.08 clat percentiles (usec): | 1.00th=[ 1], 5.00th=[ 2], 10.00th=[ 3], 20.00th=[ 6], | 30.00th=[ 11], 40.00th=[ 12], 50.00th=[ 12], 60.00th=[ 20], | 70.00th=[ 21], 80.00th=[ 22], 90.00th=[ 23], 95.00th=[ 24], | 99.00th=[ 46], 99.50th=[ 86], 99.90th=[ 91], 99.95th=[ 92], | 99.99th=[ 96] bw (MB /s): min= 1359, max= 2700, per=6.25%, avg=2026.98, stdev=363.84 lat (usec) : 2=4.70%, 4=8.45%, 10=11.33%, 20=32.66%, 50=42.21% lat (usec) : 100=0.65%, 250=0.01%, 500=0.01%, 750=0.01% lat (msec) : 2=0.01% cpu : usr=99.74%, sys=0.21%, ctx=2538, majf=0, minf=2378756 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=4029100/w=76494589/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: io=204940MB, aggrb=1707.9MB/s, minb=1707.9MB/s, maxb=1707.9MB/s, mint=120001msec, maxt=120001msec WRITE: io=3798.1GB, aggrb=32417MB/s, minb=32417MB/s, maxb=32417MB/s, mint=120001msec, maxt=120001msec HARDCLOCK entries Count Pct State Function 1955 64.97% USER __memcpy_avx_unaligned [/lib64/libc-2.22.so] 1037 34.46% USER UNKNOWN 6 0.20% SYS do_page_fault 5 0.17% SYS find_next_iomem_res 3 0.10% SYS pagerange_is_ram_callback 1 0.03% SYS page_add_new_anon_rmap 1 0.03% SYS lookup_memtype 1 0.03% SYS vmf_insert_pfn_pmd Count Pct HARDCLOCK Stack trace ============================================================ 4 0.13% do_page_fault page_fault unknown | __memcpy_avx_unaligned 3 0.10% find_next_iomem_res pagerange_is_ram_callback walk_system_ram_range pat_pagerange_is_ram lookup_memtype track_pfn_insert vmf_insert_pfn_pmd dax_dev_pmd_fault handle_mm_fault __do_page_fault do_page_fault page_fault unknown | __memcpy_avx_unaligned 2 0.07% pagerange_is_ram_callback walk_system_ram_range pat_pagerange_is_ram lookup_memtype track_pfn_insert vmf_insert_pfn_pmd dax_dev_pmd_fault handle_mm_fault __do_page_fault do_page_fault page_fault unknown | __memcpy_avx_unaligned 3. One possible issue is the mmap ioengine MMAP_TOTAL_SZ define "limits us to 1GiB of mapped files in total". I expect real software using device DAX will want to map the entire device with one mmap() call, then perform loads, stores, cache flushes, etc. - things that the NVML libraries help it do correctly. As is, I think fio keeps mapping and unmapping, which exercises the kernel more than the hardware. Yes, fio maps entire device by 1 mmap(), ******** SYSTEM CALL REPORT ******** System Call Name Count Rate ElpTime Avg Max Errs AvSz KB/s futex 7 0.2 0.101176 0.014454 0.101147 1 SLEEP 2 0.0 0.101148 0.050574 Sleep Func 2 0.101148 0.050574 0.101140 futex_wait_queue_me RUNQ 0.000004 CPU 0.000024 open 2 0.0 0.000132 0.000066 0.000130 0 read 1 0.0 0.000009 0.000009 0.000009 0 3 0.0 shmat 1 0.0 0.000006 0.000006 0.000006 0 mmap 1 0.0 0.000003 0.000003 0.000003 0 4. The CPU instructions used by fio depend on the glibc library version. As mentioned in an earlier fio thread, that changes a lot. With libc2.24.so, random reads seem to be done with rep movsb. The test box SLES12 SP3 has glibc2.22, so we have to update. But I can't understand what does it mean - " random reads seem to be done with rep movsb " -----Original Message----- From: Elliott, Robert (Persistent Memory) Sent: Tuesday, November 28, 2017 6:52 AM To: Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com>; Rebecca Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com> Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com> Subject: RE: fio 3.2 > -----Original Message----- > From: Gavriliuk, Anton (HPS Ukraine) > Sent: Monday, November 27, 2017 7:13 PM > Subject: RE: fio 3.2 > > No, I have true 4 pmem devices right now, 300Gb each, I think that suggestion was for Sitsofe. > On Nov 27, 2017, at 12:38 PM, Sitsofe Wheeler <sitsofe@gmail.com > <mailto:sitsofe@gmail.com>> wrote: > > > Unfortunately I don't have access to a pmem device but let's see how > > far we get: > > > > On 27 November 2017 at 12:39, Gavriliuk, Anton (HPS Ukraine) > > <anton.gavriliuk@hpe.com <mailto:anton.gavriliuk@hpe.com>> wrote: > >> > >> result=$(fio --name=random-writers --ioengine=mmap --iodepth=32 > >> --rw=randwrite --bs=64k --size=1024m --numjobs=8 > >> --group_reporting=1 --eta=never --time_based --runtime=60 > >> --filename=/dev/dax0.0 | grep > >> WRITE) ... > > Here you've switched ioengine introducing another place to look. > > Instead how about this: > > ... > > (apparently a size has to be specified when you try to use a > > character device - see https://nvdimm.wiki.kernel.org/ ) > > ... > > (Perhaps the documentation for these ioengines and pmem devices > > needs to be improved?) There are several oddities with device DAX (/dev/dax0.0 character devices, providing a window to persistent memory representing the whole device) compared to filesystem DAX (mounting ext4 or xfs with -o dax and using mmap to access files). 1. I recommend using allow_file_create=0 to ensure fio doesn't just create a plain file called "/dev/dax0.0" on your regular storage device and do its I/Os to that. 2. This script will convert 16 pmem devices into device DAX mode (trashing all data in the process): #!/bin/bash for i in {0..15} do echo working on $i # -M map puts struct page data in regular memory, not nvdimm memory # that's fine for NVDIMM-Ns; intended for larger capacities ndctl create-namespace -m dax -M mem -e namespace$i.0 -f done ndctl carves out some space for device DAX metadata. For 8 GiB of persistent memory, the resulting character device sizes are: -M mem: 8587837440 bytes = 8190 MiB (8 GiB - 2 MiB) -M dev: 8453619712 bytes = 8062 MiB (8 GiB - 2 MiB - 128 MiB) Since /dev/dax is a character device, it has no "size", so you must manually tell it the size. We should patch fio to automatically detect that from /sys/class/dax/dax0.0/size in linux. I don't know how this would work in Windows. 3. One possible issue is the mmap ioengine MMAP_TOTAL_SZ define "limits us to 1GiB of mapped files in total". I expect real software using device DAX will want to map the entire device with one mmap() call, then perform loads, stores, cache flushes, etc. - things that the NVML libraries help it do correctly. As is, I think fio keeps mapping and unmapping, which exercises the kernel more than the hardware. 4. The CPU instructions used by fio depend on the glibc library version. As mentioned in an earlier fio thread, that changes a lot. With libc2.24.so, random reads seem to be done with rep movsb. norandommap, randrepeat=0, zero_buffers, and gtod_reduce=1 help reduce fio overhead at these rates. I don't think iodepth is used by the mmap ioengine. 5. If the blocksize * number of threads exceeds the CPU cache size, you'll start generating lots of traffic to regular memory in addition to persistent memory. Example: 36 threads generating 64 KiB reads from /dev/dax0.0 ends up reading read 11 GB/s from persistent memory; the writes stay in CPU cache (no memory write traffic). |---------------------------------------||---------------------------------------| |-- Socket 0 --||-- Socket 1 --| |---------------------------------------||---------------------------------------| |-- Memory Channel Monitoring --||-- Memory Channel Monitoring --| |---------------------------------------||---------------------------------------| |-- Mem Ch 0: Reads (MB/s): 11022.33 --||-- Mem Ch 0: Reads (MB/s): 15.60 --| |-- Writes(MB/s): 22.29 --||-- Writes(MB/s): 7.09 --| |-- Mem Ch 1: Reads (MB/s): 56.72 --||-- Mem Ch 1: Reads (MB/s): 9.34 --| |-- Writes(MB/s): 17.03 --||-- Writes(MB/s): 0.81 --| |-- Mem Ch 4: Reads (MB/s): 60.12 --||-- Mem Ch 4: Reads (MB/s): 15.79 --| |-- Writes(MB/s): 23.66 --||-- Writes(MB/s): 7.10 --| |-- Mem Ch 5: Reads (MB/s): 54.94 --||-- Mem Ch 5: Reads (MB/s): 9.65 --| |-- Writes(MB/s): 17.30 --||-- Writes(MB/s): 1.02 --| |-- NODE 0 Mem Read (MB/s) : 11194.11 --||-- NODE 1 Mem Read (MB/s) : 50.37 --| |-- NODE 0 Mem Write(MB/s) : 80.28 --||-- NODE 1 Mem Write(MB/s) : 16.02 --| |-- NODE 0 P. Write (T/s): 192832 --||-- NODE 1 P. Write (T/s): 190615 --| |-- NODE 0 Memory (MB/s): 11274.38 --||-- NODE 1 Memory (MB/s): 66.39 --| |---------------------------------------||---------------------------------------| |---------------------------------------||---------------------------------------| |-- System Read Throughput(MB/s): 11332.59 --| |-- System Write Throughput(MB/s): 96.23 --| |-- System Memory Throughput(MB/s): 11428.82 --| |---------------------------------------||---------------------------------------| Increasing to 2 MiB reads, the bandwidth from persistent memory drops to 9 GB/s, most of those turn into writes to regular memory (with some reads as well as the caches thrash). |---------------------------------------||---------------------------------------| |-- Socket 0 --||-- Socket 1 --| |---------------------------------------||---------------------------------------| |-- Memory Channel Monitoring --||-- Memory Channel Monitoring --| |---------------------------------------||---------------------------------------| |-- Mem Ch 0: Reads (MB/s): 9069.96 --||-- Mem Ch 0: Reads (MB/s): 18.33 --| |-- Writes(MB/s): 2026.71 --||-- Writes(MB/s): 8.44 --| |-- Mem Ch 1: Reads (MB/s): 1070.84 --||-- Mem Ch 1: Reads (MB/s): 11.08 --| |-- Writes(MB/s): 2021.61 --||-- Writes(MB/s): 2.11 --| |-- Mem Ch 4: Reads (MB/s): 1069.20 --||-- Mem Ch 4: Reads (MB/s): 17.85 --| |-- Writes(MB/s): 2028.27 --||-- Writes(MB/s): 8.51 --| |-- Mem Ch 5: Reads (MB/s): 1062.81 --||-- Mem Ch 5: Reads (MB/s): 11.59 --| |-- Writes(MB/s): 2021.74 --||-- Writes(MB/s): 2.31 --| |-- NODE 0 Mem Read (MB/s) : 12272.82 --||-- NODE 1 Mem Read (MB/s) : 58.85 --| |-- NODE 0 Mem Write(MB/s) : 8098.33 --||-- NODE 1 Mem Write(MB/s) : 21.36 --| |-- NODE 0 P. Write (T/s): 220528 --||-- NODE 1 P. Write (T/s): 190643 --| |-- NODE 0 Memory (MB/s): 20371.14 --||-- NODE 1 Memory (MB/s): 80.21 --| |---------------------------------------||---------------------------------------| |---------------------------------------||---------------------------------------| |-- System Read Throughput(MB/s): 12331.67 --| |-- System Write Throughput(MB/s): 8119.69 --| |-- System Memory Throughput(MB/s): 20451.36 --| |---------------------------------------||---------------------------------------| Example simple script: [random-test] kb_base=1000 ioengine=mmap iodepth=1 rw=randread bs=64KiB size=7GiB numjobs=36 cpus_allowed_policy=split cpus_allowed=0-17,36-53 group_reporting=1 time_based runtime=60 allow_file_create=0 filename=/dev/dax0.0 Example more aggressive script: [global] kb_base=1000 ioengine=mmap iodepth=1 rw=randread bs=64KiB ba=64KiB numjobs=36 cpus_allowed_policy=split cpus_allowed=0-17,36-53 group_reporting=1 norandommap randrepeat=0 zero_buffers gtod_reduce=1 time_based runtime=60999 allow_file_create=0 [d0] size=7GiB filename=/dev/dax0.0 [d1] size=7GiB filename=/dev/dax1.0 [d2] size=7GiB filename=/dev/dax2.0 [d3] size=7GiB filename=/dev/dax3.0 [d4] size=7GiB filename=/dev/dax4.0 [d5] size=7GiB filename=/dev/dax5.0 [d6] size=7GiB filename=/dev/dax6.0 [d7] size=7GiB filename=/dev/dax7.0 ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: fio 3.2 2017-11-29 3:35 ` Gavriliuk, Anton (HPS Ukraine) @ 2017-11-29 4:13 ` Elliott, Robert (Persistent Memory) 2017-11-29 4:44 ` Rebecca Cran ` (3 more replies) 0 siblings, 4 replies; 39+ messages in thread From: Elliott, Robert (Persistent Memory) @ 2017-11-29 4:13 UTC (permalink / raw) To: Gavriliuk, Anton (HPS Ukraine), Rebecca Cran, Sitsofe Wheeler Cc: fio@vger.kernel.org, Kani, Toshimitsu > -----Original Message----- > From: Gavriliuk, Anton (HPS Ukraine) > Sent: Tuesday, November 28, 2017 9:35 PM > To: Elliott, Robert (Persistent Memory) <elliott@hpe.com>; Rebecca Cran > <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com> > Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com> > Subject: RE: fio 3.2 > 4. The CPU instructions used by fio depend on the glibc library version. > As mentioned in an earlier fio thread, that changes a lot. With > libc2.24.so, random reads seem to be done with rep movsb. > > The test box SLES12 SP3 has glibc2.22, so we have to update. But I can't > understand what does it mean - " random reads seem to be done with rep > movsb " While the test is running, run perf top and select one of the busiest functions (hopefully a memcpy function): 90.14% libc-2.24.so [.] __memmove_avx_unaligned_erms 3.85% [unknown] [k] 0x00007f7414f5c4ff 0.53% [kernel] [k] wait_consider_task 0.33% fio [.] get_io_u 0.29% fio [.] td_io_queue 0.26% fio [.] io_u_sync_complete Hit enter twice to get to the assembly language code. This shows you the amount of time the CPUs are spending in each instruction (based on sampling): 0.03 │ 2b: cmp __x86_shared_non_temporal_threshold,%rdx 0.01 │ ↓ jae 15d │ cmp %rsi,%rdi 0.00 │ ↓ jb 4c │ ↓ je 51 │ lea (%rsi,%rdx,1),%r9 │ cmp %r9,%rdi │ ↓ jb 211 │ 4c: mov %rdx,%rcx 99.88 │ rep movsb %ds:(%rsi),%es:(%rdi) 0.08 │ 51: ← retq │ 52: cmp $0x10,%dl That means 99.88% of 90.14% of the time is spend on rep movsb, presumably reading from persistent memory (and writing to regular memory, but to a buffer that's entirely in the CPU caches). Some more examples of how to analyze the results: If I remove the norandommap option, fio spends about 1% of the CPU time in a function maintaining the list of which LBAs it has read: 89.63% libc-2.24.so [.] __memmove_avx_unaligned_erms 3.28% [unknown] [k] 0x00007f34da4824ff 0.87% fio [.] axmap_isset 0.58% [kernel] [k] wait_consider_task 0.38% fio [.] get_io_u 0.34% fio [.] io_u_sync_complete If I switch to random writes and remove the zerobuffers option, fio spends about 7% of the CPU time creating non-zero write data (to then write into persistent memory): 79.04% libc-2.24.so [.] __memmove_avx_unaligned_erms 6.98% fio [.] get_io_u 4.24% [unknown] [k] 0x00007f8ee86d34ff 0.85% fio [.] io_queue_event 0.82% fio [.] axmap_isset 0.62% fio [.] io_u_sync_complete 0.62% [kernel] [k] wait_consider_task 0.58% fio [.] td_io_queue 0.45% fio [.] thread_main get_io_u calls small_content_scramble which calls memcpy for 8 bytes at a time. On my system, gcc chose to inline memcpy rather than call the glibc library version, and it ends up just using regular mov instructions: │ memcpy(): │ │ __fortify_function void * │ __NTH (memcpy (void *__restrict __dest, const void *__restrict __src, │ size_t __len)) │ { │ return __builtin___memcpy_chk (__dest, __src, __len, __bos0 (__dest)); 32.56 │ mov %rax,-0x200(%rcx,%rsi,1) 5.18 │ mov 0x0(%rbp),%rsi │ small_content_scramble(): │ get_verify = 1; 2.49 │ add $0x200,%rax 2.81 │ mov 0x8(%rbp),%rdi │ memcpy(): 32.91 │ mov %rsi,-0x10(%rcx) 3.45 │ mov %rdi,-0x8(%rcx) │ small_content_scramble(): │ td->trim_batch = td->o.trim_backlog; │ get_trim = 1; │ } │ │ if (get_trim && get_next_trim(td, io_u)) │ return true; small_content_scramble has hardly been touched since 2011, so it probably hasn't had much performance analysis. One of the few changes made was to add an integer divide by 1000, which is always slow (painfully slow on some CPUs): offset = ((io_u->start_time.tv_nsec/1000) ^ boffset) & 511; perf top doesn't show that taking time - I think the compiler realized it could pull that calculation out of the loop and just do it once. Different compilers and compiler options might not realize that. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: fio 3.2 2017-11-29 4:13 ` Elliott, Robert (Persistent Memory) @ 2017-11-29 4:44 ` Rebecca Cran 2017-11-30 4:04 ` Jens Axboe 2017-11-29 6:55 ` Gavriliuk, Anton (HPS Ukraine) ` (2 subsequent siblings) 3 siblings, 1 reply; 39+ messages in thread From: Rebecca Cran @ 2017-11-29 4:44 UTC (permalink / raw) To: Elliott, Robert (Persistent Memory) Cc: Gavriliuk, Anton (HPS Ukraine), Sitsofe Wheeler, fio@vger.kernel.org, Kani, Toshimitsu I’ve not followed the whole discussion, but I did some tests a couple of months ago replacing the calls to memcpy in the memmap engine with a new fio_memcpy function, rebuilding fio with the -ftree-vectorize flag along with -msse2, -mavx etc. I saw an improvement of a few hundred MB/sec on my openSUSE system. — Rebecca Sent from my iPhone > On Nov 28, 2017, at 9:13 PM, Elliott, Robert (Persistent Memory) <elliott@hpe.com> wrote: > > >> -----Original Message----- >> From: Gavriliuk, Anton (HPS Ukraine) >> Sent: Tuesday, November 28, 2017 9:35 PM >> To: Elliott, Robert (Persistent Memory) <elliott@hpe.com>; Rebecca Cran >> <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com> >> Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com> >> Subject: RE: fio 3.2 > > >> 4. The CPU instructions used by fio depend on the glibc library version. >> As mentioned in an earlier fio thread, that changes a lot. With >> libc2.24.so, random reads seem to be done with rep movsb. >> >> The test box SLES12 SP3 has glibc2.22, so we have to update. But I can't >> understand what does it mean - " random reads seem to be done with rep >> movsb " > > While the test is running, run > perf top > > and select one of the busiest functions (hopefully a memcpy function): > > 90.14% libc-2.24.so [.] __memmove_avx_unaligned_erms > 3.85% [unknown] [k] 0x00007f7414f5c4ff > 0.53% [kernel] [k] wait_consider_task > 0.33% fio [.] get_io_u > 0.29% fio [.] td_io_queue > 0.26% fio [.] io_u_sync_complete > > Hit enter twice to get to the assembly language code. This shows you > the amount of time the CPUs are spending in each instruction (based on > sampling): > > 0.03 │ 2b: cmp __x86_shared_non_temporal_threshold,%rdx > 0.01 │ ↓ jae 15d > │ cmp %rsi,%rdi > 0.00 │ ↓ jb 4c > │ ↓ je 51 > │ lea (%rsi,%rdx,1),%r9 > │ cmp %r9,%rdi > │ ↓ jb 211 > │ 4c: mov %rdx,%rcx > 99.88 │ rep movsb %ds:(%rsi),%es:(%rdi) > 0.08 │ 51: ↠retq > │ 52: cmp $0x10,%dl > > That means 99.88% of 90.14% of the time is spend on rep movsb, presumably > reading from persistent memory (and writing to regular memory, but to > a buffer that's entirely in the CPU caches). > > Some more examples of how to analyze the results: > > If I remove the norandommap option, fio spends about 1% of the CPU time > in a function maintaining the list of which LBAs it has read: > 89.63% libc-2.24.so [.] __memmove_avx_unaligned_erms > 3.28% [unknown] [k] 0x00007f34da4824ff > 0.87% fio [.] axmap_isset > 0.58% [kernel] [k] wait_consider_task > 0.38% fio [.] get_io_u > 0.34% fio [.] io_u_sync_complete > > > If I switch to random writes and remove the zerobuffers option, fio spends > about 7% of the CPU time creating non-zero write data (to then write > into persistent memory): > 79.04% libc-2.24.so [.] __memmove_avx_unaligned_erms > 6.98% fio [.] get_io_u > 4.24% [unknown] [k] 0x00007f8ee86d34ff > 0.85% fio [.] io_queue_event > 0.82% fio [.] axmap_isset > 0.62% fio [.] io_u_sync_complete > 0.62% [kernel] [k] wait_consider_task > 0.58% fio [.] td_io_queue > 0.45% fio [.] thread_main > > get_io_u calls small_content_scramble which calls memcpy for 8 > bytes at a time. On my system, gcc chose to inline memcpy > rather than call the glibc library version, and it ends up just > using regular mov instructions: > > │ memcpy(): > │ > │ __fortify_function void * > │ __NTH (memcpy (void *__restrict __dest, const void *__restrict __src, > │ size_t __len)) > │ { > │ return __builtin___memcpy_chk (__dest, __src, __len, __bos0 (__dest)); > 32.56 │ mov %rax,-0x200(%rcx,%rsi,1) > 5.18 │ mov 0x0(%rbp),%rsi > │ small_content_scramble(): > │ get_verify = 1; > 2.49 │ add $0x200,%rax > 2.81 │ mov 0x8(%rbp),%rdi > │ memcpy(): > 32.91 │ mov %rsi,-0x10(%rcx) > 3.45 │ mov %rdi,-0x8(%rcx) > │ small_content_scramble(): > │ td->trim_batch = td->o.trim_backlog; > │ get_trim = 1; > │ } > │ > │ if (get_trim && get_next_trim(td, io_u)) > │ return true; > > small_content_scramble has hardly been touched since 2011, so it probably > hasn't had much performance analysis. > > One of the few changes made was to add an integer divide by 1000, which > is always slow (painfully slow on some CPUs): > > offset = ((io_u->start_time.tv_nsec/1000) ^ boffset) & 511; > > perf top doesn't show that taking time - I think the compiler realized > it could pull that calculation out of the loop and just do it once. Different > compilers and compiler options might not realize that. > > > \x04èº{.nÇ+‰·Ÿ®‰†+%ŠËlzwm…ébë§²æìr¸›yø¨Š{ayº\x1dʇڙë,j\a¢f£¢·hš‹àz¹\x1e®w¥¢¸\f¢·¦j:+v‰¨ŠwèjØm¶Ÿÿ¾\a«‘êçzZ+ƒùšŽŠÝ¢j"ú!¶i ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: fio 3.2 2017-11-29 4:44 ` Rebecca Cran @ 2017-11-30 4:04 ` Jens Axboe 0 siblings, 0 replies; 39+ messages in thread From: Jens Axboe @ 2017-11-30 4:04 UTC (permalink / raw) To: Rebecca Cran, Elliott, Robert (Persistent Memory) Cc: Gavriliuk, Anton (HPS Ukraine), Sitsofe Wheeler, fio@vger.kernel.org, Kani, Toshimitsu On 11/28/2017 09:44 PM, Rebecca Cran wrote: > I’ve not followed the whole discussion, but I did some tests a couple > of months ago replacing the calls to memcpy in the memmap engine with > a new fio_memcpy function, rebuilding fio with the -ftree-vectorize > flag along with -msse2, -mavx etc. I saw an improvement of a few > hundred MB/sec on my openSUSE system. Care to send a patch for that? I think it'd be nice for fio to pack it's own, especially when the performance varies so wildly and is platform (and even version) dependent. -- Jens Axboe ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: fio 3.2 2017-11-29 4:13 ` Elliott, Robert (Persistent Memory) 2017-11-29 4:44 ` Rebecca Cran @ 2017-11-29 6:55 ` Gavriliuk, Anton (HPS Ukraine) 2017-11-29 7:40 ` Gavriliuk, Anton (HPS Ukraine) 2017-11-30 4:21 ` Jens Axboe 3 siblings, 0 replies; 39+ messages in thread From: Gavriliuk, Anton (HPS Ukraine) @ 2017-11-29 6:55 UTC (permalink / raw) To: Elliott, Robert (Persistent Memory), Rebecca Cran, Sitsofe Wheeler Cc: fio@vger.kernel.org, Kani, Toshimitsu Ok. I will drill down my pmem tests with perf tool. On my system, gcc chose to inline memcpy rather than call the glibc library version, and it ends up just using regular mov instructions What's your system ? Which gcc version ? Anton -----Original Message----- From: Elliott, Robert (Persistent Memory) Sent: Wednesday, November 29, 2017 6:14 AM To: Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com>; Rebecca Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com> Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com> Subject: RE: fio 3.2 > -----Original Message----- > From: Gavriliuk, Anton (HPS Ukraine) > Sent: Tuesday, November 28, 2017 9:35 PM > To: Elliott, Robert (Persistent Memory) <elliott@hpe.com>; Rebecca > Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com> > Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com> > Subject: RE: fio 3.2 > 4. The CPU instructions used by fio depend on the glibc library version. > As mentioned in an earlier fio thread, that changes a lot. With > libc2.24.so, random reads seem to be done with rep movsb. > > The test box SLES12 SP3 has glibc2.22, so we have to update. But I > can't understand what does it mean - " random reads seem to be done > with rep movsb " While the test is running, run perf top and select one of the busiest functions (hopefully a memcpy function): 90.14% libc-2.24.so [.] __memmove_avx_unaligned_erms 3.85% [unknown] [k] 0x00007f7414f5c4ff 0.53% [kernel] [k] wait_consider_task 0.33% fio [.] get_io_u 0.29% fio [.] td_io_queue 0.26% fio [.] io_u_sync_complete Hit enter twice to get to the assembly language code. This shows you the amount of time the CPUs are spending in each instruction (based on sampling): 0.03 │ 2b: cmp __x86_shared_non_temporal_threshold,%rdx 0.01 │ ↓ jae 15d │ cmp %rsi,%rdi 0.00 │ ↓ jb 4c │ ↓ je 51 │ lea (%rsi,%rdx,1),%r9 │ cmp %r9,%rdi │ ↓ jb 211 │ 4c: mov %rdx,%rcx 99.88 │ rep movsb %ds:(%rsi),%es:(%rdi) 0.08 │ 51: ← retq │ 52: cmp $0x10,%dl That means 99.88% of 90.14% of the time is spend on rep movsb, presumably reading from persistent memory (and writing to regular memory, but to a buffer that's entirely in the CPU caches). Some more examples of how to analyze the results: If I remove the norandommap option, fio spends about 1% of the CPU time in a function maintaining the list of which LBAs it has read: 89.63% libc-2.24.so [.] __memmove_avx_unaligned_erms 3.28% [unknown] [k] 0x00007f34da4824ff 0.87% fio [.] axmap_isset 0.58% [kernel] [k] wait_consider_task 0.38% fio [.] get_io_u 0.34% fio [.] io_u_sync_complete If I switch to random writes and remove the zerobuffers option, fio spends about 7% of the CPU time creating non-zero write data (to then write into persistent memory): 79.04% libc-2.24.so [.] __memmove_avx_unaligned_erms 6.98% fio [.] get_io_u 4.24% [unknown] [k] 0x00007f8ee86d34ff 0.85% fio [.] io_queue_event 0.82% fio [.] axmap_isset 0.62% fio [.] io_u_sync_complete 0.62% [kernel] [k] wait_consider_task 0.58% fio [.] td_io_queue 0.45% fio [.] thread_main get_io_u calls small_content_scramble which calls memcpy for 8 bytes at a time. On my system, gcc chose to inline memcpy rather than call the glibc library version, and it ends up just using regular mov instructions: │ memcpy(): │ │ __fortify_function void * │ __NTH (memcpy (void *__restrict __dest, const void *__restrict __src, │ size_t __len)) │ { │ return __builtin___memcpy_chk (__dest, __src, __len, __bos0 (__dest)); 32.56 │ mov %rax,-0x200(%rcx,%rsi,1) 5.18 │ mov 0x0(%rbp),%rsi │ small_content_scramble(): │ get_verify = 1; 2.49 │ add $0x200,%rax 2.81 │ mov 0x8(%rbp),%rdi │ memcpy(): 32.91 │ mov %rsi,-0x10(%rcx) 3.45 │ mov %rdi,-0x8(%rcx) │ small_content_scramble(): │ td->trim_batch = td->o.trim_backlog; │ get_trim = 1; │ } │ │ if (get_trim && get_next_trim(td, io_u)) │ return true; small_content_scramble has hardly been touched since 2011, so it probably hasn't had much performance analysis. One of the few changes made was to add an integer divide by 1000, which is always slow (painfully slow on some CPUs): offset = ((io_u->start_time.tv_nsec/1000) ^ boffset) & 511; perf top doesn't show that taking time - I think the compiler realized it could pull that calculation out of the loop and just do it once. Different compilers and compiler options might not realize that. ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: fio 3.2 2017-11-29 4:13 ` Elliott, Robert (Persistent Memory) 2017-11-29 4:44 ` Rebecca Cran 2017-11-29 6:55 ` Gavriliuk, Anton (HPS Ukraine) @ 2017-11-29 7:40 ` Gavriliuk, Anton (HPS Ukraine) 2017-11-29 8:22 ` Elliott, Robert (Persistent Memory) 2017-11-30 4:21 ` Jens Axboe 3 siblings, 1 reply; 39+ messages in thread From: Gavriliuk, Anton (HPS Ukraine) @ 2017-11-29 7:40 UTC (permalink / raw) To: Elliott, Robert (Persistent Memory), Rebecca Cran, Sitsofe Wheeler Cc: fio@vger.kernel.org, Kani, Toshimitsu Previously I used fio 2.12 and it works fine. Now I build fio 3.2 with gcc 4.8.5 and it returns error, dl560g10spmem01:~ # /usr/local/bin/fio --name=4-rand-rw-3xx --ioengine=mmap --iodepth=1 --rw=randrw --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --size=100g --numjobs=16 --group_reporting --runtime=120 --filename=/dev/dax0.0 --allow_file_create=0 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1 ... fio-3.2 Starting 16 processes 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 Run status group 0 (all jobs): dl560g10spmem01:~ # -----Original Message----- From: Gavriliuk, Anton (HPS Ukraine) Sent: Wednesday, November 29, 2017 8:56 AM To: Elliott, Robert (Persistent Memory) <elliott@hpe.com>; Rebecca Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com> Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com> Subject: RE: fio 3.2 Ok. I will drill down my pmem tests with perf tool. On my system, gcc chose to inline memcpy rather than call the glibc library version, and it ends up just using regular mov instructions What's your system ? Which gcc version ? Anton -----Original Message----- From: Elliott, Robert (Persistent Memory) Sent: Wednesday, November 29, 2017 6:14 AM To: Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com>; Rebecca Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com> Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com> Subject: RE: fio 3.2 > -----Original Message----- > From: Gavriliuk, Anton (HPS Ukraine) > Sent: Tuesday, November 28, 2017 9:35 PM > To: Elliott, Robert (Persistent Memory) <elliott@hpe.com>; Rebecca > Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com> > Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com> > Subject: RE: fio 3.2 > 4. The CPU instructions used by fio depend on the glibc library version. > As mentioned in an earlier fio thread, that changes a lot. With > libc2.24.so, random reads seem to be done with rep movsb. > > The test box SLES12 SP3 has glibc2.22, so we have to update. But I > can't understand what does it mean - " random reads seem to be done > with rep movsb " While the test is running, run perf top and select one of the busiest functions (hopefully a memcpy function): 90.14% libc-2.24.so [.] __memmove_avx_unaligned_erms 3.85% [unknown] [k] 0x00007f7414f5c4ff 0.53% [kernel] [k] wait_consider_task 0.33% fio [.] get_io_u 0.29% fio [.] td_io_queue 0.26% fio [.] io_u_sync_complete Hit enter twice to get to the assembly language code. This shows you the amount of time the CPUs are spending in each instruction (based on sampling): 0.03 │ 2b: cmp __x86_shared_non_temporal_threshold,%rdx 0.01 │ ↓ jae 15d │ cmp %rsi,%rdi 0.00 │ ↓ jb 4c │ ↓ je 51 │ lea (%rsi,%rdx,1),%r9 │ cmp %r9,%rdi │ ↓ jb 211 │ 4c: mov %rdx,%rcx 99.88 │ rep movsb %ds:(%rsi),%es:(%rdi) 0.08 │ 51: ← retq │ 52: cmp $0x10,%dl That means 99.88% of 90.14% of the time is spend on rep movsb, presumably reading from persistent memory (and writing to regular memory, but to a buffer that's entirely in the CPU caches). Some more examples of how to analyze the results: If I remove the norandommap option, fio spends about 1% of the CPU time in a function maintaining the list of which LBAs it has read: 89.63% libc-2.24.so [.] __memmove_avx_unaligned_erms 3.28% [unknown] [k] 0x00007f34da4824ff 0.87% fio [.] axmap_isset 0.58% [kernel] [k] wait_consider_task 0.38% fio [.] get_io_u 0.34% fio [.] io_u_sync_complete If I switch to random writes and remove the zerobuffers option, fio spends about 7% of the CPU time creating non-zero write data (to then write into persistent memory): 79.04% libc-2.24.so [.] __memmove_avx_unaligned_erms 6.98% fio [.] get_io_u 4.24% [unknown] [k] 0x00007f8ee86d34ff 0.85% fio [.] io_queue_event 0.82% fio [.] axmap_isset 0.62% fio [.] io_u_sync_complete 0.62% [kernel] [k] wait_consider_task 0.58% fio [.] td_io_queue 0.45% fio [.] thread_main get_io_u calls small_content_scramble which calls memcpy for 8 bytes at a time. On my system, gcc chose to inline memcpy rather than call the glibc library version, and it ends up just using regular mov instructions: │ memcpy(): │ │ __fortify_function void * │ __NTH (memcpy (void *__restrict __dest, const void *__restrict __src, │ size_t __len)) │ { │ return __builtin___memcpy_chk (__dest, __src, __len, __bos0 (__dest)); 32.56 │ mov %rax,-0x200(%rcx,%rsi,1) 5.18 │ mov 0x0(%rbp),%rsi │ small_content_scramble(): │ get_verify = 1; 2.49 │ add $0x200,%rax 2.81 │ mov 0x8(%rbp),%rdi │ memcpy(): 32.91 │ mov %rsi,-0x10(%rcx) 3.45 │ mov %rdi,-0x8(%rcx) │ small_content_scramble(): │ td->trim_batch = td->o.trim_backlog; │ get_trim = 1; │ } │ │ if (get_trim && get_next_trim(td, io_u)) │ return true; small_content_scramble has hardly been touched since 2011, so it probably hasn't had much performance analysis. One of the few changes made was to add an integer divide by 1000, which is always slow (painfully slow on some CPUs): offset = ((io_u->start_time.tv_nsec/1000) ^ boffset) & 511; perf top doesn't show that taking time - I think the compiler realized it could pull that calculation out of the loop and just do it once. Different compilers and compiler options might not realize that. ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: fio 3.2 2017-11-29 7:40 ` Gavriliuk, Anton (HPS Ukraine) @ 2017-11-29 8:22 ` Elliott, Robert (Persistent Memory) 2017-11-29 13:12 ` Gavriliuk, Anton (HPS Ukraine) 0 siblings, 1 reply; 39+ messages in thread From: Elliott, Robert (Persistent Memory) @ 2017-11-29 8:22 UTC (permalink / raw) To: Gavriliuk, Anton (HPS Ukraine), Rebecca Cran, Sitsofe Wheeler, Tomohiro Kusumi Cc: fio@vger.kernel.org, Kani, Toshimitsu > -----Original Message----- > From: Gavriliuk, Anton (HPS Ukraine) > Sent: Wednesday, November 29, 2017 1:40 AM > To: Elliott, Robert (Persistent Memory) <elliott@hpe.com>; Rebecca > Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com> > Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com> > Subject: RE: fio 3.2 > > Previously I used fio 2.12 and it works fine. Now I build fio 3.2 > with gcc 4.8.5 and it returns error, > > dl560g10spmem01:~ # /usr/local/bin/fio --name=4-rand-rw-3xx -- > ioengine=mmap --iodepth=1 --rw=randrw -- > bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 -- > size=100g --numjobs=16 --group_reporting --runtime=120 -- > filename=/dev/dax0.0 --allow_file_create=0 > 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B- > 256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1 > ... > fio-3.2 > Starting 16 processes > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 I see the same thing. Commit 0f34169a broke it, moving a check from a non-character device path up a level to where it now triggers on a character device (for which the size is left set to -1ULL). This gets it working again: git revert 0f34169a make Tomohiro, could you help prepare a better fix? ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: fio 3.2 2017-11-29 8:22 ` Elliott, Robert (Persistent Memory) @ 2017-11-29 13:12 ` Gavriliuk, Anton (HPS Ukraine) 2017-11-29 17:17 ` Sitsofe Wheeler 0 siblings, 1 reply; 39+ messages in thread From: Gavriliuk, Anton (HPS Ukraine) @ 2017-11-29 13:12 UTC (permalink / raw) To: Elliott, Robert (Persistent Memory), Rebecca Cran, Sitsofe Wheeler, Tomohiro Kusumi Cc: fio@vger.kernel.org, Kani, Toshimitsu Latest findings... With the fio2.12 /dev/dax & mmap(). I run trace before start test. We can see above open(), mmap() 5.630816 cpu=65 pid=45154 tgid=45154 open [2] entry *pathname=0x7f62b6fabed0 flags=NOATIME mode=0600 filename: /dev/dax0.0 5.630819 cpu=65 pid=45154 tgid=45154 open [2] ret=3 syscallbeg= 0.000003 *pathname=0x7f62b6fabed0 flags=NOATIME mode=0600 5.630833 cpu=65 pid=45154 tgid=45154 mmap [9] entry *addr=0x0 length=-2147483648 prot=READ|WRITE flags=SHARED fd=3 offset=0x0 5.630836 cpu=65 pid=45154 tgid=45154 mmap [9] ret=0x7f1a00200000 syscallbeg= 0.000003 *addr=0x0 length=-2147483648 prot=READ|WRITE flags=SHARED fd=3 offset=0x0 5.630836 cpu=65 pid=45154 tgid=45154 madvise [28] entry *addr=0x7f1a00200000 length=-2147483648 advise=0x1 5.630837 cpu=65 pid=45154 tgid=45154 madvise [28] ret=0x0 syscallbeg= 0.000001 *addr=0x7f1a00200000 length=-2147483648 advise=0x1 How it could be that mmap() length size -2Gb ??? (length=-2147483648). I pointed +290g, dl560g10spmem01:/var/work/fio-3.2 # /usr/bin/fio --filename=/dev/dax0.0 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --size=290g -----Original Message----- From: Elliott, Robert (Persistent Memory) Sent: Wednesday, November 29, 2017 10:23 AM To: Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com>; Rebecca Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com>; Tomohiro Kusumi <tkusumi@tuxera.com> Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com> Subject: RE: fio 3.2 > -----Original Message----- > From: Gavriliuk, Anton (HPS Ukraine) > Sent: Wednesday, November 29, 2017 1:40 AM > To: Elliott, Robert (Persistent Memory) <elliott@hpe.com>; Rebecca > Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com> > Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com> > Subject: RE: fio 3.2 > > Previously I used fio 2.12 and it works fine. Now I build fio 3.2 > with gcc 4.8.5 and it returns error, > > dl560g10spmem01:~ # /usr/local/bin/fio --name=4-rand-rw-3xx -- > ioengine=mmap --iodepth=1 --rw=randrw -- > bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 -- > size=100g --numjobs=16 --group_reporting --runtime=120 -- > filename=/dev/dax0.0 --allow_file_create=0 > 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B- > 256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1 ... > fio-3.2 > Starting 16 processes > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 I see the same thing. Commit 0f34169a broke it, moving a check from a non-character device path up a level to where it now triggers on a character device (for which the size is left set to -1ULL). This gets it working again: git revert 0f34169a make Tomohiro, could you help prepare a better fix? ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: fio 3.2 2017-11-29 13:12 ` Gavriliuk, Anton (HPS Ukraine) @ 2017-11-29 17:17 ` Sitsofe Wheeler 2017-11-29 20:24 ` Elliott, Robert (Persistent Memory) 0 siblings, 1 reply; 39+ messages in thread From: Sitsofe Wheeler @ 2017-11-29 17:17 UTC (permalink / raw) To: Gavriliuk, Anton (HPS Ukraine) Cc: Elliott, Robert (Persistent Memory), Rebecca Cran, Tomohiro Kusumi, fio@vger.kernel.org, Kani, Toshimitsu Unsigned number being printed as signed? On 29 November 2017 at 13:12, Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com> wrote: > Latest findings... > > With the fio2.12 /dev/dax & mmap(). I run trace before start test. We can see above open(), mmap() > > 5.630816 cpu=65 pid=45154 tgid=45154 open [2] entry *pathname=0x7f62b6fabed0 flags=NOATIME mode=0600 filename: /dev/dax0.0 > 5.630819 cpu=65 pid=45154 tgid=45154 open [2] ret=3 syscallbeg= 0.000003 *pathname=0x7f62b6fabed0 flags=NOATIME mode=0600 > 5.630833 cpu=65 pid=45154 tgid=45154 mmap [9] entry *addr=0x0 length=-2147483648 prot=READ|WRITE flags=SHARED fd=3 offset=0x0 > 5.630836 cpu=65 pid=45154 tgid=45154 mmap [9] ret=0x7f1a00200000 syscallbeg= 0.000003 *addr=0x0 length=-2147483648 prot=READ|WRITE flags=SHARED fd=3 offset=0x0 > 5.630836 cpu=65 pid=45154 tgid=45154 madvise [28] entry *addr=0x7f1a00200000 length=-2147483648 advise=0x1 > 5.630837 cpu=65 pid=45154 tgid=45154 madvise [28] ret=0x0 syscallbeg= 0.000001 *addr=0x7f1a00200000 length=-2147483648 advise=0x1 > > How it could be that mmap() length size -2Gb ??? (length=-2147483648). I pointed +290g, > > dl560g10spmem01:/var/work/fio-3.2 # /usr/bin/fio --filename=/dev/dax0.0 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --size=290g -- Sitsofe | http://sucs.org/~sits/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: fio 3.2 2017-11-29 17:17 ` Sitsofe Wheeler @ 2017-11-29 20:24 ` Elliott, Robert (Persistent Memory) 0 siblings, 0 replies; 39+ messages in thread From: Elliott, Robert (Persistent Memory) @ 2017-11-29 20:24 UTC (permalink / raw) To: Sitsofe Wheeler, Gavriliuk, Anton (HPS Ukraine) Cc: Rebecca Cran, Tomohiro Kusumi, fio@vger.kernel.org, Kani, Toshimitsu > From: fio-owner@vger.kernel.org [mailto:fio-owner@vger.kernel.org] On > Behalf Of Sitsofe Wheeler ... > On 29 November 2017 at 13:12, Gavriliuk, Anton (HPS Ukraine) > <anton.gavriliuk@hpe.com> wrote: > > Latest findings... > > > > With the fio2.12 /dev/dax & mmap(). I run trace before start test. We > can see above open(), mmap() > > ... > > 5.630837 cpu=65 pid=45154 tgid=45154 madvise [28] ret=0x0 > syscallbeg= 0.000001 *addr=0x7f1a00200000 length=-2147483648 advise=0x1 > > > > How it could be that mmap() length size -2Gb ??? (length=-2147483648). > > I pointed +290g, > > Unsigned number being printed as signed? Yes, that's 0x80000000 in hex. The size_t argument is indeed unsigned (ssize_t is signed): int madvise(void *addr, size_t length, int advice); This print is apparently from: https://github.com/HewlettPackard/LinuxKI in src/kiinfo/syscalls.c print_syscall_args(): switch (format) { case SKIP: break; case DECIMAL: printf ("%c%s=%i", fsep, label, (int32)argval); break; case HEX: printf ("%c%s=0x%llx", fsep, label, argval); break; case OCTAL: printf ("%c%s=0%llo", fsep, label, argval); break; printed as a signed 32-bit value because of src/kiinfo/globals.c: {"madvise", "ret", HEX, "*addr", HEX, "length", DECIMAL, "advise", HEX, NULL, SKIP, NULL, SKIP, NULL, SKIP}, {"mmap", "ret", HEX, "*addr", HEX, "length", DECIMAL, "prot", MMAP_PROT, "flags", MMAP_FLAGS, "fd", DECIMAL, "offset", HEX}, That project should differentiate between signed and unsigned arguments, and add support for 64-bit signed and unsigned arguments as well. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: fio 3.2 2017-11-29 4:13 ` Elliott, Robert (Persistent Memory) ` (2 preceding siblings ...) 2017-11-29 7:40 ` Gavriliuk, Anton (HPS Ukraine) @ 2017-11-30 4:21 ` Jens Axboe 2017-11-30 14:17 ` Gavriliuk, Anton (HPS Ukraine) 2017-11-30 16:13 ` Jens Axboe 3 siblings, 2 replies; 39+ messages in thread From: Jens Axboe @ 2017-11-30 4:21 UTC (permalink / raw) To: Elliott, Robert (Persistent Memory), Gavriliuk, Anton (HPS Ukraine), Rebecca Cran, Sitsofe Wheeler Cc: fio@vger.kernel.org, Kani, Toshimitsu On 11/28/2017 09:13 PM, Elliott, Robert (Persistent Memory) wrote: > small_content_scramble has hardly been touched since 2011, so it probably > hasn't had much performance analysis. That's fair, would be a good thing to look at, especially since it's on by default. > One of the few changes made was to add an integer divide by 1000, which > is always slow (painfully slow on some CPUs): > > offset = ((io_u->start_time.tv_nsec/1000) ^ boffset) & 511; > > perf top doesn't show that taking time - I think the compiler realized > it could pull that calculation out of the loop and just do it once. Different > compilers and compiler options might not realize that. Yeah, int divides suck. This didn't used to be like this, it got added with the nsec conversion. I have fixed this up now. -- Jens Axboe ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: fio 3.2 2017-11-30 4:21 ` Jens Axboe @ 2017-11-30 14:17 ` Gavriliuk, Anton (HPS Ukraine) 2017-11-30 21:23 ` Jens Axboe 2017-11-30 16:13 ` Jens Axboe 1 sibling, 1 reply; 39+ messages in thread From: Gavriliuk, Anton (HPS Ukraine) @ 2017-11-30 14:17 UTC (permalink / raw) To: Elliott, Robert (Persistent Memory), Rebecca Cran, Sitsofe Wheeler Cc: fio@vger.kernel.org, Kani, Toshimitsu It's there any chance to fix it ? dl560g10spmem01:/var/work # /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --refill_buffers --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --size=290g 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1 ... fio-2.99 Starting 16 processes 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 -----Original Message----- From: Jens Axboe [mailto:axboe@kernel.dk] Sent: Thursday, November 30, 2017 6:22 AM To: Elliott, Robert (Persistent Memory) <elliott@hpe.com>; Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com>; Rebecca Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com> Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com> Subject: Re: fio 3.2 On 11/28/2017 09:13 PM, Elliott, Robert (Persistent Memory) wrote: > small_content_scramble has hardly been touched since 2011, so it > probably hasn't had much performance analysis. That's fair, would be a good thing to look at, especially since it's on by default. > One of the few changes made was to add an integer divide by 1000, > which is always slow (painfully slow on some CPUs): > > offset = ((io_u->start_time.tv_nsec/1000) ^ boffset) & 511; > > perf top doesn't show that taking time - I think the compiler realized > it could pull that calculation out of the loop and just do it once. > Different compilers and compiler options might not realize that. Yeah, int divides suck. This didn't used to be like this, it got added with the nsec conversion. I have fixed this up now. -- Jens Axboe ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: fio 3.2 2017-11-30 14:17 ` Gavriliuk, Anton (HPS Ukraine) @ 2017-11-30 21:23 ` Jens Axboe 2017-12-01 5:21 ` Gavriliuk, Anton (HPS Ukraine) ` (3 more replies) 0 siblings, 4 replies; 39+ messages in thread From: Jens Axboe @ 2017-11-30 21:23 UTC (permalink / raw) To: Gavriliuk, Anton (HPS Ukraine), Elliott, Robert (Persistent Memory), Rebecca Cran, Sitsofe Wheeler Cc: fio@vger.kernel.org, Kani, Toshimitsu On 11/30/2017 07:17 AM, Gavriliuk, Anton (HPS Ukraine) wrote: > It's there any chance to fix it ? > > dl560g10spmem01:/var/work # /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --refill_buffers --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --size=290g > 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1 > ... > fio-2.99 > Starting 16 processes > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 > > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 > > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 > > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 > > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 > > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 > > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 Should already be fixed as of yesterday morning. -- Jens Axboe ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: fio 3.2 2017-11-30 21:23 ` Jens Axboe @ 2017-12-01 5:21 ` Gavriliuk, Anton (HPS Ukraine) 2017-12-01 7:15 ` Robert Elliott (Persistent Memory) 2017-12-03 3:24 ` Gavriliuk, Anton (HPS Ukraine) ` (2 subsequent siblings) 3 siblings, 1 reply; 39+ messages in thread From: Gavriliuk, Anton (HPS Ukraine) @ 2017-12-01 5:21 UTC (permalink / raw) To: Jens Axboe, Robert Elliott (Persistent Memory), Rebecca Cran, Sitsofe Wheeler Cc: fio@vger.kernel.org, Kani, Toshimitsu Ok, thanks Jens and Elliott I fixed that. Now with /dev/dax0.0 and mmap ioengine slat is gone. But if fio maps entire /dev/dax0.0 device once and the using CPU load/store, why clat number's so high, 6us for read, 12us for write ?? dl560g10spmem01:/var/work/fio # taskset -c 0-19 /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --size=290g 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1 ... fio-3.2-53-ga7d0-dirty Starting 16 processes Jobs: 1 (f=1): [_(5),m(1),_(10)][100.0%][r=1142MiB/s,w=21.2GiB/s][r=22.4k,w=428k IOPS][eta 00m:00s] 4-rand-rw-3xx: (groupid=0, jobs=16): err= 0: pid=67241: Thu Nov 30 23:04:51 2017 read: IOPS=37.0k, BW=1931MiB/s (2025MB/s)(232GiB/123040msec) clat (nsec): min=197, max=753983, avg=6709.21, stdev=4157.00 lat (nsec): min=218, max=754005, avg=6740.89, stdev=4157.15 clat percentiles (nsec): | 1.00th=[ 572], 5.00th=[ 996], 10.00th=[ 1528], 20.00th=[ 3568], | 30.00th=[ 4832], 40.00th=[ 6880], 50.00th=[ 7328], 60.00th=[ 7712], | 70.00th=[ 8096], 80.00th=[ 8512], 90.00th=[ 9536], 95.00th=[10688], | 99.00th=[25984], 99.50th=[31360], 99.90th=[40192], 99.95th=[45824], | 99.99th=[54016] bw ( KiB/s): min=110400, max=148800, per=6.28%, avg=124171.80, stdev=4366.58, samples=3910 iops : min= 2114, max= 2842, avg=2384.24, stdev=74.15, samples=3910 write: IOPS=721k, BW=35.8GiB/s (38.5GB/s)(4408GiB/123040msec) clat (nsec): min=72, max=814061, avg=12174.57, stdev=7463.58 lat (nsec): min=91, max=814132, avg=12221.87, stdev=7475.72 clat percentiles (nsec): | 1.00th=[ 684], 5.00th=[ 1400], 10.00th=[ 2064], 20.00th=[ 6304], | 30.00th=[ 8384], 40.00th=[12992], 50.00th=[13760], 60.00th=[14400], | 70.00th=[15040], 80.00th=[15936], 90.00th=[17024], 95.00th=[18048], | 99.00th=[46336], 99.50th=[59648], 99.90th=[63232], 99.95th=[64256], | 99.99th=[71168] bw ( MiB/s): min= 2197, max= 2647, per=6.28%, avg=2303.69, stdev=20.83, samples=3910 iops : min=43091, max=52348, avg=45290.89, stdev=447.39, samples=3910 lat (nsec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=1.99%, 1000=1.53% lat (usec) : 2=6.41%, 4=6.96%, 10=18.78%, 20=61.82%, 50=1.56% lat (usec) : 100=0.95%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% cpu : usr=99.80%, sys=0.18%, ctx=4735, majf=0, minf=2519628 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwt: total=4671537,88741744,0, short=0,0,0, dropped=0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=1931MiB/s (2025MB/s), 1931MiB/s-1931MiB/s (2025MB/s-2025MB/s), io=232GiB (249GB), run=123040-123040msec WRITE: bw=35.8GiB/s (38.5GB/s), 35.8GiB/s-35.8GiB/s (38.5GB/s-38.5GB/s), io=4408GiB (4733GB), run=123040-123040msec Fio numa placement doesn't work, when I remove taskset and add numa_cpu_nodes dl560g10spmem01:/var/work/fio # dl560g10spmem01:/var/work/fio # /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --size=300g --numa_cpu_nodes=0 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1 ... fio-3.2-53-ga7d0-dirty Starting 16 processes fio: pid=67474, got signal=7 fio: pid=67481, got signal=7 fio: pid=67479, got signal=7 fio: pid=67482, got signal=7 fio: pid=67483, got signal=7 fio: pid=67480, got signal=7 fio: pid=67486, got signal=7 fio: pid=67485, got signal=7 fio: pid=67475, got signal=74),K(5),m(1),K(2),m(1)][0.4%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 29m:52s] fio: pid=67478, got signal=7 fio: pid=67472, got signal=7 fio: pid=67484, got signal=7 fio: pid=67476, got signal=72),m(2),K(9),m(1)][0.5%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 30m:22s] fio: pid=67487, got signal=7 fio: pid=67473, got signal=7 fio: pid=67477, got signal=7 4-rand-rw-3xx: (groupid=0, jobs=16): err= 0: pid=67472: Thu Nov 30 23:13:07 2017 lat (usec) : 4=12.80%, 10=22.30%, 20=62.03%, 50=2.87% cpu : usr=0.00%, sys=0.00%, ctx=0, majf=0, minf=0 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwt: total=24,429,0, short=0,0,0, dropped=0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): fio: file hash not empty on exit dl560g10spmem01:/var/work/fio # -----Original Message----- From: Jens Axboe [mailto:axboe@kernel.dk] Sent: Thursday, November 30, 2017 11:24 PM To: Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com>; Robert Elliott (Persistent Memory) <elliott@hpe.com>; Rebecca Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com> Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com> Subject: Re: fio 3.2 On 11/30/2017 07:17 AM, Gavriliuk, Anton (HPS Ukraine) wrote: > It's there any chance to fix it ? > > dl560g10spmem01:/var/work # /usr/local/bin/fio --filename=/dev/dax0.0 > --rw=randrw --refill_buffers --randrepeat=0 --ioengine=mmap > --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 > --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting > --name=4-rand-rw-3xx --size=290g > 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) > 4096B-256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1 ... > fio-2.99 > Starting 16 processes > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 > > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 > > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 > > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 > > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 > > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 > > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 Should already be fixed as of yesterday morning. -- Jens Axboe ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: fio 3.2 2017-12-01 5:21 ` Gavriliuk, Anton (HPS Ukraine) @ 2017-12-01 7:15 ` Robert Elliott (Persistent Memory) 2017-12-03 9:35 ` Sitsofe Wheeler 0 siblings, 1 reply; 39+ messages in thread From: Robert Elliott (Persistent Memory) @ 2017-12-01 7:15 UTC (permalink / raw) To: Gavriliuk, Anton (HPS Ukraine), Jens Axboe, Rebecca Cran, Sitsofe Wheeler Cc: fio@vger.kernel.org, Kani, Toshimitsu > Fio numa placement doesn't work, when I remove taskset and add > numa_cpu_nodes > > dl560g10spmem01:/var/work/fio # taskset -c 0-19 /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --size=290g > dl560g10spmem01:/var/work/fio # /usr/local/bin/fio -- > filename=/dev/dax0.0 --rw=randrw --refill_buffers --norandommap -- > randrepeat=0 --ioengine=mmap -- > bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 -- > iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4- > rand-rw-3xx --size=300g --numa_cpu_nodes=0 > Starting 16 processes > fio: pid=67474, got signal=7 signal=7 occurs if the size is too big; you increased the size by 10 GiB and probably hit the limit (not accounting for the ndctl metadata overhead). It's probably a 300 GiB capacity While discussing NUMA, I'll mention something else I saw in Windows while fixing the thread affinities there. At startup, fio spawns threads on all CPUs to measure the clocks (fio_monotonic_clocktest). If you've constrained the CPU affinity outside fio, some of those will fail. In Windows, something like START /AFFINITY 0x55555555 fio ... can cause half of the clock threads to fail. My Windows machine stopped working around then, so I haven't gotten around to trying a fix for that yet. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: fio 3.2 2017-12-01 7:15 ` Robert Elliott (Persistent Memory) @ 2017-12-03 9:35 ` Sitsofe Wheeler 2017-12-03 17:10 ` Jens Axboe 0 siblings, 1 reply; 39+ messages in thread From: Sitsofe Wheeler @ 2017-12-03 9:35 UTC (permalink / raw) To: Robert Elliott (Persistent Memory) Cc: Gavriliuk, Anton (HPS Ukraine), Jens Axboe, Rebecca Cran, fio@vger.kernel.org, Kani, Toshimitsu On 1 December 2017 at 07:15, Robert Elliott (Persistent Memory) <elliott@hpe.com> wrote: > While discussing NUMA, I'll mention something else I saw in Windows > while fixing the thread affinities there. > > At startup, fio spawns threads on all CPUs to measure the clocks > (fio_monotonic_clocktest). If you've constrained the CPU affinity > outside fio, some of those will fail. In Windows, something like > START /AFFINITY 0x55555555 fio ... > can cause half of the clock threads to fail. This is very weird and doesn't make any sense (but I believe you): if you have multiple threads crammed on to the same CPUs the TSC no longer looks like it monotonically increases? Surely it should be MORE likely to increase because a thread is likely to be on the same CPU as another and can't actually be running at the same time as the other? Something suspicious is going on - to this day I'm still unsure as to why https://github.com/sitsofe/fio/commit/1a7a1fd98ad749768212f49f5ba3e9a1392c4263 would sometimes fail on Windows (but seemingly not on Linux)... -- Sitsofe | http://sucs.org/~sits/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: fio 3.2 2017-12-03 9:35 ` Sitsofe Wheeler @ 2017-12-03 17:10 ` Jens Axboe 2017-12-05 20:29 ` Elliott, Robert (Persistent Memory) 0 siblings, 1 reply; 39+ messages in thread From: Jens Axboe @ 2017-12-03 17:10 UTC (permalink / raw) To: Sitsofe Wheeler, Robert Elliott (Persistent Memory) Cc: Gavriliuk, Anton (HPS Ukraine), Rebecca Cran, fio@vger.kernel.org, Kani, Toshimitsu On 12/03/2017 02:35 AM, Sitsofe Wheeler wrote: > On 1 December 2017 at 07:15, Robert Elliott (Persistent Memory) > <elliott@hpe.com> wrote: >> While discussing NUMA, I'll mention something else I saw in Windows >> while fixing the thread affinities there. >> >> At startup, fio spawns threads on all CPUs to measure the clocks >> (fio_monotonic_clocktest). If you've constrained the CPU affinity >> outside fio, some of those will fail. In Windows, something like >> START /AFFINITY 0x55555555 fio ... >> can cause half of the clock threads to fail. > > This is very weird and doesn't make any sense (but I believe you): if > you have multiple threads crammed on to the same CPUs the TSC no > longer looks like it monotonically increases? Surely it should be MORE > likely to increase because a thread is likely to be on the same CPU as > another and can't actually be running at the same time as the other? The threads fail to start, it's not a TSC failure. I'm guessing it's because fio gets limited to a subset of the available CPUs, and that causes fio to fail doing the clock check when fio_setaffinity() in clock_thread_fn() fails. -- Jens Axboe ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: fio 3.2 2017-12-03 17:10 ` Jens Axboe @ 2017-12-05 20:29 ` Elliott, Robert (Persistent Memory) 0 siblings, 0 replies; 39+ messages in thread From: Elliott, Robert (Persistent Memory) @ 2017-12-05 20:29 UTC (permalink / raw) To: Jens Axboe, Sitsofe Wheeler Cc: Gavriliuk, Anton (HPS Ukraine), Rebecca Cran, fio@vger.kernel.org, Kani > -----Original Message----- > From: fio-owner@vger.kernel.org [mailto:fio-owner@vger.kernel.org] On > Behalf Of Jens Axboe > Sent: Sunday, December 3, 2017 11:11 AM > To: Sitsofe Wheeler <sitsofe@gmail.com>; Robert Elliott (Persistent > Memory) <elliott@hpe.com> > Cc: Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com>; Rebecca Cran > <rebecca@bluestop.org>; fio@vger.kernel.org; Kani, Toshimitsu > <toshi.kani@hpe.com> > Subject: Re: fio 3.2 > > On 12/03/2017 02:35 AM, Sitsofe Wheeler wrote: > > On 1 December 2017 at 07:15, Robert Elliott (Persistent Memory) > > <elliott@hpe.com> wrote: > >> While discussing NUMA, I'll mention something else I saw in Windows > >> while fixing the thread affinities there. > >> > >> At startup, fio spawns threads on all CPUs to measure the clocks > >> (fio_monotonic_clocktest). If you've constrained the CPU affinity > >> outside fio, some of those will fail. In Windows, something like > >> START /AFFINITY 0x55555555 fio ... > >> can cause half of the clock threads to fail. > > > > This is very weird and doesn't make any sense (but I believe you): if > > you have multiple threads crammed on to the same CPUs the TSC no > > longer looks like it monotonically increases? Surely it should be MORE > > likely to increase because a thread is likely to be on the same CPU as > > another and can't actually be running at the same time as the other? > > The threads fail to start, it's not a TSC failure. I'm guessing it's > because fio gets limited to a subset of the available CPUs, and that > causes fio to fail doing the clock check when fio_setaffinity() in > clock_thread_fn() fails. I'll work on a patch to limit all threads (including clocktest threads at startup and I/O threads) to the parent's affinity mask in Windows. In linux, processes and threads are not restricted to their parent's affinity mask. Although they inherit by default, they may call sched_setaffinity with any values. The taskset and numactl manpages don't mention that programs may override the settings requested by the user. Example: $ numactl -a -C 5 fio --debug=process -dax_mmap.fio including: cpus_allowed_policy=split cpus_allowed=0-17,36-53 thread fio is allowed to place threads on all the CPUs: ROB: process 20358 cpu=5 thread affinity_count=1 ROB: process 20353 cpu=0 thread affinity_count=1 ROB: process 20355 cpu=2 thread affinity_count=1 ROB: process 20354 cpu=1 thread affinity_count=1 ROB: process 20356 cpu=3 thread affinity_count=1 ... Should fio apply the same filtering in linux? If not, should it print an info or warning message when creating threads that violate the parent mask? --- Robert Elliott, HPE Persistent Memory ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: fio 3.2 2017-11-30 21:23 ` Jens Axboe 2017-12-01 5:21 ` Gavriliuk, Anton (HPS Ukraine) @ 2017-12-03 3:24 ` Gavriliuk, Anton (HPS Ukraine) 2017-12-04 17:12 ` Gavriliuk, Anton (HPS Ukraine) 2017-12-04 17:30 ` Gavriliuk, Anton (HPS Ukraine) 3 siblings, 0 replies; 39+ messages in thread From: Gavriliuk, Anton (HPS Ukraine) @ 2017-12-03 3:24 UTC (permalink / raw) To: Jens Axboe, Robert Elliott (Persistent Memory), Rebecca Cran, Sitsofe Wheeler Cc: fio@vger.kernel.org, Kani, Toshimitsu Ok, thanks. Now I built latest fio available :-) dl560g10spmem01:~ # /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --norandommap --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --size=290g --numa_cpu_nodes=0 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1 ... fio-3.2-64-gaf7fb Starting 16 processes Jobs: 16 (f=16): [m(16)][8.0%][r=2138MiB/s,w=39.7GiB/s][r=42.2k,w=799k IOPS][eta 01m:44s] From perf top, Samples: 1M of event 'cycles:ppp', Event count (approx.): 503978848613 Overhead Shared Object Symbol ◆ 93.34% libc-2.22.so [.] __memcpy_avx_unaligned Just can't drill it down, __memcpy_avx_unaligned /lib64/libc-2.22.so Huh? No selection. Report to linux-kernel@vger.kernel.org While other functions can, get_io_u /usr/local/bin/fio 0.08 │ xor %eax,%esi 0.01 │ shl $0x6,%rsi 0.02 │ and $0x1c0,%esi 0.00 │ sub %rax,%rsi 0.12 │ add %rdx,%rsi 0.00 │ sub %rax,%rdx │ nop │ /* │ * Fill offset into start of cacheline, time into end │ * of cacheline │ */ │ iptr = (void *) p + (offset << 6); │ *iptr = boffset; 16.01 │ 230: mov %rax,(%rsi,%rax,1) │ │ iptr = (void *) p + 64 - 2 * sizeof(uint64_t); │ iptr[0] = io_u->start_time.tv_sec; 7.03 │ mov 0x0(%rbp),%rcx 13.52 │ mov %rcx,0x30(%rdx,%rax,1) -----Original Message----- From: Jens Axboe [mailto:axboe@kernel.dk] Sent: Thursday, November 30, 2017 11:24 PM To: Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com>; Robert Elliott (Persistent Memory) <elliott@hpe.com>; Rebecca Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com> Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com> Subject: Re: fio 3.2 On 11/30/2017 07:17 AM, Gavriliuk, Anton (HPS Ukraine) wrote: > It's there any chance to fix it ? > > dl560g10spmem01:/var/work # /usr/local/bin/fio --filename=/dev/dax0.0 > --rw=randrw --refill_buffers --randrepeat=0 --ioengine=mmap > --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 > --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting > --name=4-rand-rw-3xx --size=290g > 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) > 4096B-256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1 ... > fio-2.99 > Starting 16 processes > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 > > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 > > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 > > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 > > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 > > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 > > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 Should already be fixed as of yesterday morning. -- Jens Axboe ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: fio 3.2 2017-11-30 21:23 ` Jens Axboe 2017-12-01 5:21 ` Gavriliuk, Anton (HPS Ukraine) 2017-12-03 3:24 ` Gavriliuk, Anton (HPS Ukraine) @ 2017-12-04 17:12 ` Gavriliuk, Anton (HPS Ukraine) 2017-12-04 17:30 ` Robert Elliott (Persistent Memory) 2017-12-04 18:41 ` Jens Axboe 2017-12-04 17:30 ` Gavriliuk, Anton (HPS Ukraine) 3 siblings, 2 replies; 39+ messages in thread From: Gavriliuk, Anton (HPS Ukraine) @ 2017-12-04 17:12 UTC (permalink / raw) To: Jens Axboe, Robert Elliott (Persistent Memory), Rebecca Cran, Sitsofe Wheeler Cc: fio@vger.kernel.org, Kani, Toshimitsu I have run next command several times, perf record /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --size=290g --numa_cpu_nodes=0 each perf report shows me toppest clock_thread_fn(), Samples: 8M of event 'cycles:ppp', Event count (approx.): 23391179916235 Overhead Command Shared Object Symbol 75.27% fio fio [.] clock_thread_fn 13.68% fio libc-2.22.so [.] __memcpy_avx_unaligned 10.05% fio fio [.] fill_random_buf Samples: 8M of event 'cycles:ppp', Event count (approx.): 28889015047191 Overhead Command Shared Object Symbol 79.97% fio fio [.] clock_thread_fn 11.08% fio libc-2.22.so [.] __memcpy_avx_unaligned 8.14% fio fio [.] fill_random_buf 0.13% fio fio [.] get_io_u Samples: 8M of event 'cycles:ppp', Event count (approx.): 57473976919058 Overhead Command Shared Object Symbol 89.94% fio fio [.] clock_thread_fn 5.57% fio libc-2.22.so [.] __memcpy_avx_unaligned 4.09% fio fio [.] fill_random_buf 0.07% fio fio [.] get_io_u Samples: 8M of event 'cycles:ppp', Event count (approx.): 160828042961056 Overhead Command Shared Object Symbol ◆ 92.30% fio fio [.] clock_thread_fn ▒ 4.10% fio [kernel.kallsyms] [k] filemap_map_pages ▒ 1.99% fio libc-2.22.so [.] __memcpy_avx_unaligned ▒ 1.46% fio fio [.] fill_random_buf ▒ 0.02% fio fio [.] get_io_u ▒ 0.02% fio fio [.] io_u_sync_complete Samples: 8M of event 'cycles:ppp', Event count (approx.): 163028870650810 Overhead Command Shared Object Symbol 87.68% fio fio [.] clock_thread_fn 8.77% fio [kernel.kallsyms] [k] run_timer_softirq 1.96% fio libc-2.22.so [.] __memcpy_avx_unaligned 1.44% fio fio [.] fill_random_buf 0.02% fio fio [.] get_io_u 0.02% fio fio [.] io_u_sync_complete 89.62 │ lock cmpxchg %edi,(%rsi) ▒ │ clock_thread_fn(): ▒ │ seq = *t->seq; ▒ │ if (seq == UINT_MAX) ▒ │ break; ▒ │ __sync_synchronize(); ▒ │ tsc = get_cpu_clock(); ▒ │ } while (seq != atomic32_compare_and_swap(t->seq, seq, seq + 1)); ▒ │ cmp %eax,%ecx ▒ │ ↓ je 118 ◆ │ uint32_t seq; ▒ │ uint64_t tsc; ▒ │ ▒ │ c->cpu = t->cpu; ▒ │ do { ▒ │ seq = *t->seq; ▒ │ c9: mov 0x68(%rbx),%rcx ▒ 10.01 │ mov (%rcx),%ecx May you please clarify, is it ok ?, what is the lock contention... Anton ▒ -----Original Message----- From: Gavriliuk, Anton (HPS Ukraine) Sent: Sunday, December 03, 2017 5:24 AM To: 'Jens Axboe' <axboe@kernel.dk>; Robert Elliott (Persistent Memory) <elliott@hpe.com>; Rebecca Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com> Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com> Subject: RE: fio 3.2 Ok, thanks. Now I built latest fio available :-) dl560g10spmem01:~ # /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --norandommap --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --size=290g --numa_cpu_nodes=0 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1 ... fio-3.2-64-gaf7fb Starting 16 processes Jobs: 16 (f=16): [m(16)][8.0%][r=2138MiB/s,w=39.7GiB/s][r=42.2k,w=799k IOPS][eta 01m:44s] From perf top, Samples: 1M of event 'cycles:ppp', Event count (approx.): 503978848613 Overhead Shared Object Symbol ◆ 93.34% libc-2.22.so [.] __memcpy_avx_unaligned Just can't drill it down, __memcpy_avx_unaligned /lib64/libc-2.22.so Huh? No selection. Report to linux-kernel@vger.kernel.org While other functions can, get_io_u /usr/local/bin/fio 0.08 │ xor %eax,%esi 0.01 │ shl $0x6,%rsi 0.02 │ and $0x1c0,%esi 0.00 │ sub %rax,%rsi 0.12 │ add %rdx,%rsi 0.00 │ sub %rax,%rdx │ nop │ /* │ * Fill offset into start of cacheline, time into end │ * of cacheline │ */ │ iptr = (void *) p + (offset << 6); │ *iptr = boffset; 16.01 │ 230: mov %rax,(%rsi,%rax,1) │ │ iptr = (void *) p + 64 - 2 * sizeof(uint64_t); │ iptr[0] = io_u->start_time.tv_sec; 7.03 │ mov 0x0(%rbp),%rcx 13.52 │ mov %rcx,0x30(%rdx,%rax,1) -----Original Message----- From: Jens Axboe [mailto:axboe@kernel.dk] Sent: Thursday, November 30, 2017 11:24 PM To: Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com>; Robert Elliott (Persistent Memory) <elliott@hpe.com>; Rebecca Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com> Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com> Subject: Re: fio 3.2 On 11/30/2017 07:17 AM, Gavriliuk, Anton (HPS Ukraine) wrote: > It's there any chance to fix it ? > > dl560g10spmem01:/var/work # /usr/local/bin/fio --filename=/dev/dax0.0 > --rw=randrw --refill_buffers --randrepeat=0 --ioengine=mmap > --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 > --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting > --name=4-rand-rw-3xx --size=290g > 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) > 4096B-256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1 ... > fio-2.99 > Starting 16 processes > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 > > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 > > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 > > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 > > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 > > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 > > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 Should already be fixed as of yesterday morning. -- Jens Axboe ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: fio 3.2 2017-12-04 17:12 ` Gavriliuk, Anton (HPS Ukraine) @ 2017-12-04 17:30 ` Robert Elliott (Persistent Memory) 2017-12-04 18:25 ` Gavriliuk, Anton (HPS Ukraine) 2017-12-04 18:41 ` Jens Axboe 1 sibling, 1 reply; 39+ messages in thread From: Robert Elliott (Persistent Memory) @ 2017-12-04 17:30 UTC (permalink / raw) To: Gavriliuk, Anton (HPS Ukraine), Jens Axboe, Rebecca Cran, Sitsofe Wheeler Cc: fio@vger.kernel.org, Kani, Toshimitsu > -----Original Message----- > From: Gavriliuk, Anton (HPS Ukraine) > Sent: Monday, December 4, 2017 11:12 AM ... > I have run next command several times, > > perf record /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw -- > refill_buffers --norandommap --randrepeat=0 --ioengine=mmap -- > bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 -- > iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw- > 3xx --size=290g --numa_cpu_nodes=0 > > each perf report shows me toppest clock_thread_fn(), > > Samples: 8M of event 'cycles:ppp', Event count (approx.): 23391179916235 > Overhead Command Shared Object Symbol > 75.27% fio fio [.] clock_thread_fn > 13.68% fio libc-2.22.so [.] __memcpy_avx_unaligned > 10.05% fio fio [.] fill_random_buf This just runs one pass through memory, so the startup overhead is magnified. Add --time_based=1 so it really runs for 1800 s. ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: fio 3.2 2017-12-04 17:30 ` Robert Elliott (Persistent Memory) @ 2017-12-04 18:25 ` Gavriliuk, Anton (HPS Ukraine) 2017-12-04 18:50 ` Jens Axboe 0 siblings, 1 reply; 39+ messages in thread From: Gavriliuk, Anton (HPS Ukraine) @ 2017-12-04 18:25 UTC (permalink / raw) To: Robert Elliott (Persistent Memory), Jens Axboe, Rebecca Cran, Sitsofe Wheeler Cc: fio@vger.kernel.org, Kani, Toshimitsu Ok, now test for 10 minutes with the same I/O latencies . So there is no startup overhead. dl560g10spmem01:~ # perf record /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=600 --group_reporting --name=4-rand-rw-3xx --size=290g --numa_cpu_nodes=0 --time_based=1 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1 ... fio-3.2-64-gaf7fb Starting 16 processes read: IOPS=36.7k, BW=1865MiB/s (1956MB/s)(1093GiB/600002msec) clat (nsec): min=120, max=16029k, avg=6897.44, stdev=5503.68 lat (nsec): min=144, max=16029k, avg=6932.35, stdev=5504.08 write: IOPS=697k, BW=34.6GiB/s (37.2GB/s)(20.3TiB/600002msec) clat (nsec): min=111, max=72027k, avg=12546.82, stdev=9845.85 lat (nsec): min=132, max=72027k, avg=12592.74, stdev=9851.98 Samples: 38M of event 'cycles:ppp', Event count (approx.): 182321387032131 Overhead Command Shared Object Symbol 77.20% fio fio [.] clock_thread_fn 8.32% fio libc-2.22.so [.] __memcpy_avx_unaligned 6.07% fio fio [.] fill_random_buf 4.82% fio [kernel.kallsyms] [k] page_remove_rmap 3.02% fio [kernel.kallsyms] [k] native_write_msr_safe 0.10% fio fio [.] get_io_u 88.99 │ lock cmpxchg %edi,(%rsi) │ clock_thread_fn(): │ seq = *t->seq; │ if (seq == UINT_MAX) │ break; │ __sync_synchronize(); │ tsc = get_cpu_clock(); │ } while (seq != atomic32_compare_and_swap(t->seq, seq, seq + 1)); │ cmp %eax,%ecx │ ↓ je 118 │ uint32_t seq; │ uint64_t tsc; │ │ c->cpu = t->cpu; │ do { │ seq = *t->seq; 0.01 │ c9: mov 0x68(%rbx),%rcx 10.57 │ mov (%rcx),%ecx │ if (seq == UINT_MAX) │ cmp $0xffffffff,%ecx The same for sequential write it's much faster, write latency dropped 12us -> 5us, results in bw. But still blocked on clock_thread_fn(), dl560g10spmem01:~ # perf record /usr/local/bin/fio --filename=/dev/dax0.0 --rw=write --refill_buffers --norandommap --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --iodepth=1 --numjobs=16 --runtime=600 --group_reporting --name=4-rand-rw-3xx --size=290g --numa_cpu_nodes=0 --time_based=1 4-rand-rw-3xx: (g=0): rw=write, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1 ... fio-3.2-64-gaf7fb Starting 16 processes write: IOPS=949k, BW=47.2GiB/s (50.6GB/s)(27.6TiB/600002msec) clat (nsec): min=214, max=36102k, avg=7406.20, stdev=5435.29 lat (nsec): min=233, max=36102k, avg=7442.12, stdev=5439.69 Samples: 38M of event 'cycles:ppp', Event count (approx.): 240873167676034 Overhead Command Shared Object Symbol 85.36% fio fio [.] clock_thread_fn 6.26% fio fio [.] fill_random_buf 5.07% fio libc-2.22.so [.] __memcpy_avx_unaligned 2.74% fio [kernel.kallsyms] [k] radix_tree_next_chunk 91.46 │ lock cmpxchg %edi,(%rsi) │ clock_thread_fn(): │ seq = *t->seq; │ if (seq == UINT_MAX) │ break; │ __sync_synchronize(); │ tsc = get_cpu_clock(); │ } while (seq != atomic32_compare_and_swap(t->seq, seq, seq + 1)); │ cmp %eax,%ecx │ ↓ je 118 │ uint32_t seq; │ uint64_t tsc; │ │ c->cpu = t->cpu; │ do { │ seq = *t->seq; 0.01 │ c9: mov 0x68(%rbx),%rcx 8.10 │ mov (%rcx),%ecx Run again, the same results. Is it possible to break lat (nsec) down ? Anton -----Original Message----- From: Robert Elliott (Persistent Memory) Sent: Monday, December 04, 2017 7:31 PM To: Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com>; Jens Axboe <axboe@kernel.dk>; Rebecca Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com> Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com> Subject: RE: fio 3.2 > -----Original Message----- > From: Gavriliuk, Anton (HPS Ukraine) > Sent: Monday, December 4, 2017 11:12 AM ... > I have run next command several times, > > perf record /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw -- > refill_buffers --norandommap --randrepeat=0 --ioengine=mmap -- > bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 -- > iodepth=1 --numjobs=16 --runtime=1800 --group_reporting > --name=4-rand-rw- 3xx --size=290g --numa_cpu_nodes=0 > > each perf report shows me toppest clock_thread_fn(), > > Samples: 8M of event 'cycles:ppp', Event count (approx.): 23391179916235 > Overhead Command Shared Object Symbol > 75.27% fio fio [.] clock_thread_fn > 13.68% fio libc-2.22.so [.] __memcpy_avx_unaligned > 10.05% fio fio [.] fill_random_buf This just runs one pass through memory, so the startup overhead is magnified. Add --time_based=1 so it really runs for 1800 s. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: fio 3.2 2017-12-04 18:25 ` Gavriliuk, Anton (HPS Ukraine) @ 2017-12-04 18:50 ` Jens Axboe 2017-12-04 19:17 ` Jeff Furlong 0 siblings, 1 reply; 39+ messages in thread From: Jens Axboe @ 2017-12-04 18:50 UTC (permalink / raw) To: Gavriliuk, Anton (HPS Ukraine), Robert Elliott (Persistent Memory), Rebecca Cran, Sitsofe Wheeler Cc: fio@vger.kernel.org, Kani, Toshimitsu On 12/04/2017 11:25 AM, Gavriliuk, Anton (HPS Ukraine) wrote: > Ok, now test for 10 minutes with the same I/O latencies . So there is no startup overhead. > > dl560g10spmem01:~ # perf record /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=600 --group_reporting --name=4-rand-rw-3xx --size=290g --numa_cpu_nodes=0 --time_based=1 > 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1 > ... > fio-3.2-64-gaf7fb > Starting 16 processes > > read: IOPS=36.7k, BW=1865MiB/s (1956MB/s)(1093GiB/600002msec) > clat (nsec): min=120, max=16029k, avg=6897.44, stdev=5503.68 > lat (nsec): min=144, max=16029k, avg=6932.35, stdev=5504.08 > write: IOPS=697k, BW=34.6GiB/s (37.2GB/s)(20.3TiB/600002msec) > clat (nsec): min=111, max=72027k, avg=12546.82, stdev=9845.85 > lat (nsec): min=132, max=72027k, avg=12592.74, stdev=9851.98 > > Samples: 38M of event 'cycles:ppp', Event count (approx.): 182321387032131 > Overhead Command Shared Object Symbol > 77.20% fio fio [.] clock_thread_fn > 8.32% fio libc-2.22.so [.] __memcpy_avx_unaligned > 6.07% fio fio [.] fill_random_buf > 4.82% fio [kernel.kallsyms] [k] page_remove_rmap > 3.02% fio [kernel.kallsyms] [k] native_write_msr_safe > 0.10% fio fio [.] get_io_u That seems so odd, it makes no sense. --no-children added to perf report? I think this is skewed because the code is so hot for startup. -- Jens Axboe ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: fio 3.2 2017-12-04 18:50 ` Jens Axboe @ 2017-12-04 19:17 ` Jeff Furlong 2017-12-04 19:29 ` Gavriliuk, Anton (HPS Ukraine) 0 siblings, 1 reply; 39+ messages in thread From: Jeff Furlong @ 2017-12-04 19:17 UTC (permalink / raw) To: Jens Axboe, Gavriliuk, Anton (HPS Ukraine), Robert Elliott (Persistent Memory), Rebecca Cran, Sitsofe Wheeler Cc: fio@vger.kernel.org, Kani, Toshimitsu >That seems so odd, it makes no sense. --no-children added to perf report? >I think this is skewed because the code is so hot for startup. Perhaps you want to add --ramp_time=5s to remove startup jitter from host/device? Regards, Jeff ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: fio 3.2 2017-12-04 19:17 ` Jeff Furlong @ 2017-12-04 19:29 ` Gavriliuk, Anton (HPS Ukraine) 0 siblings, 0 replies; 39+ messages in thread From: Gavriliuk, Anton (HPS Ukraine) @ 2017-12-04 19:29 UTC (permalink / raw) To: Jeff Furlong, Jens Axboe, Robert Elliott <elliott@hpe.com>, Rebecca Cran (Persistent Memory), Sitsofe Wheeler Cc: fio@vger.kernel.org, Kani, Toshimitsu Yes, it makes sence. I also run perf top 15-20 secs later after fio. Samples: 138K of event 'cycles:ppp', Event count (approx.): 95410433826 Overhead Shared Object Symbol 93.41% libc-2.22.so [.] __memcpy_avx_unaligned 2.36% fio [.] get_io_u 0.65% fio [.] ramp_time_over 0.63% fio [.] io_u_sync_complete 0.34% fio [.] td_io_queue 0.33% [unknown] [k] 0x00007f53f8a82b06 0.33% fio [.] __fio_gettime __memcpy_avx_unaligned /lib64/libc-2.22.so Huh? No selection. Report to linux-kernel@vger.kernel.org It's a tremendous results. But I'm still thinking is there any chance for further I/O perf boost. Anton -----Original Message----- From: Jeff Furlong [mailto:jeff.furlong@wdc.com] Sent: Monday, December 04, 2017 9:18 PM To: Jens Axboe <axboe@kernel.dk>; Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com>; Robert Elliott (Persistent Memory) <elliott@hpe.com>; Rebecca Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com> Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com> Subject: RE: fio 3.2 >That seems so odd, it makes no sense. --no-children added to perf report? >I think this is skewed because the code is so hot for startup. Perhaps you want to add --ramp_time=5s to remove startup jitter from host/device? Regards, Jeff ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: fio 3.2 2017-12-04 17:12 ` Gavriliuk, Anton (HPS Ukraine) 2017-12-04 17:30 ` Robert Elliott (Persistent Memory) @ 2017-12-04 18:41 ` Jens Axboe 1 sibling, 0 replies; 39+ messages in thread From: Jens Axboe @ 2017-12-04 18:41 UTC (permalink / raw) To: Gavriliuk, Anton (HPS Ukraine), Robert Elliott (Persistent Memory), Rebecca Cran, Sitsofe Wheeler Cc: fio@vger.kernel.org, Kani, Toshimitsu On 12/04/2017 10:12 AM, Gavriliuk, Anton (HPS Ukraine) wrote: > I have run next command several times, > > perf record /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --size=290g --numa_cpu_nodes=0 > > each perf report shows me toppest clock_thread_fn(), It's the cpu clock calibration that fio does. Just ignore it, it's run before any of the actual work starts. Additionally, please stop top posting. Proper net etiquette is replying below the context you are responding to. -- Jens Axboe ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: fio 3.2 2017-11-30 21:23 ` Jens Axboe ` (2 preceding siblings ...) 2017-12-04 17:12 ` Gavriliuk, Anton (HPS Ukraine) @ 2017-12-04 17:30 ` Gavriliuk, Anton (HPS Ukraine) 3 siblings, 0 replies; 39+ messages in thread From: Gavriliuk, Anton (HPS Ukraine) @ 2017-12-04 17:30 UTC (permalink / raw) To: Jens Axboe, Robert Elliott (Persistent Memory), Rebecca Cran, Sitsofe Wheeler Cc: fio@vger.kernel.org, Kani, Toshimitsu It's important, because fio shows 6.9us clat for reads and 13us for writes, long time..., access to the PMEM just a few hundred ns. So I need to understand where we spend time. read: IOPS=36.7k, BW=1865MiB/s (1956MB/s)(232GiB/127291msec) clat (nsec): min=163, max=16033k, avg=6827.63, stdev=9340.91 lat (nsec): min=188, max=16033k, avg=6860.92, stdev=9341.21 write: IOPS=697k, BW=34.6GiB/s (37.2GB/s)(4408GiB/127291msec) clat (nsec): min=82, max=48028k, avg=12479.56, stdev=13459.80 lat (nsec): min=106, max=48028k, avg=12524.94, stdev=13464.29 -----Original Message----- From: Gavriliuk, Anton (HPS Ukraine) Sent: Monday, December 04, 2017 7:12 PM To: 'Jens Axboe' <axboe@kernel.dk>; Robert Elliott (Persistent Memory) <elliott@hpe.com>; 'Rebecca Cran' <rebecca@bluestop.org>; 'Sitsofe Wheeler' <sitsofe@gmail.com> Cc: 'fio@vger.kernel.org' <fio@vger.kernel.org>; Kani, Toshimitsu <toshi.kani@hpe.com> Subject: RE: fio 3.2 I have run next command several times, perf record /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --size=290g --numa_cpu_nodes=0 each perf report shows me toppest clock_thread_fn(), Samples: 8M of event 'cycles:ppp', Event count (approx.): 23391179916235 Overhead Command Shared Object Symbol 75.27% fio fio [.] clock_thread_fn 13.68% fio libc-2.22.so [.] __memcpy_avx_unaligned 10.05% fio fio [.] fill_random_buf Samples: 8M of event 'cycles:ppp', Event count (approx.): 28889015047191 Overhead Command Shared Object Symbol 79.97% fio fio [.] clock_thread_fn 11.08% fio libc-2.22.so [.] __memcpy_avx_unaligned 8.14% fio fio [.] fill_random_buf 0.13% fio fio [.] get_io_u Samples: 8M of event 'cycles:ppp', Event count (approx.): 57473976919058 Overhead Command Shared Object Symbol 89.94% fio fio [.] clock_thread_fn 5.57% fio libc-2.22.so [.] __memcpy_avx_unaligned 4.09% fio fio [.] fill_random_buf 0.07% fio fio [.] get_io_u Samples: 8M of event 'cycles:ppp', Event count (approx.): 160828042961056 Overhead Command Shared Object Symbol ◆ 92.30% fio fio [.] clock_thread_fn ▒ 4.10% fio [kernel.kallsyms] [k] filemap_map_pages ▒ 1.99% fio libc-2.22.so [.] __memcpy_avx_unaligned ▒ 1.46% fio fio [.] fill_random_buf ▒ 0.02% fio fio [.] get_io_u ▒ 0.02% fio fio [.] io_u_sync_complete Samples: 8M of event 'cycles:ppp', Event count (approx.): 163028870650810 Overhead Command Shared Object Symbol 87.68% fio fio [.] clock_thread_fn 8.77% fio [kernel.kallsyms] [k] run_timer_softirq 1.96% fio libc-2.22.so [.] __memcpy_avx_unaligned 1.44% fio fio [.] fill_random_buf 0.02% fio fio [.] get_io_u 0.02% fio fio [.] io_u_sync_complete 89.62 │ lock cmpxchg %edi,(%rsi) ▒ │ clock_thread_fn(): ▒ │ seq = *t->seq; ▒ │ if (seq == UINT_MAX) ▒ │ break; ▒ │ __sync_synchronize(); ▒ │ tsc = get_cpu_clock(); ▒ │ } while (seq != atomic32_compare_and_swap(t->seq, seq, seq + 1)); ▒ │ cmp %eax,%ecx ▒ │ ↓ je 118 ◆ │ uint32_t seq; ▒ │ uint64_t tsc; ▒ │ ▒ │ c->cpu = t->cpu; ▒ │ do { ▒ │ seq = *t->seq; ▒ │ c9: mov 0x68(%rbx),%rcx ▒ 10.01 │ mov (%rcx),%ecx May you please clarify, is it ok ?, what is the lock contention... Anton ▒ -----Original Message----- From: Gavriliuk, Anton (HPS Ukraine) Sent: Sunday, December 03, 2017 5:24 AM To: 'Jens Axboe' <axboe@kernel.dk>; Robert Elliott (Persistent Memory) <elliott@hpe.com>; Rebecca Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com> Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com> Subject: RE: fio 3.2 Ok, thanks. Now I built latest fio available :-) dl560g10spmem01:~ # /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --norandommap --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --size=290g --numa_cpu_nodes=0 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1 ... fio-3.2-64-gaf7fb Starting 16 processes Jobs: 16 (f=16): [m(16)][8.0%][r=2138MiB/s,w=39.7GiB/s][r=42.2k,w=799k IOPS][eta 01m:44s] From perf top, Samples: 1M of event 'cycles:ppp', Event count (approx.): 503978848613 Overhead Shared Object Symbol ◆ 93.34% libc-2.22.so [.] __memcpy_avx_unaligned Just can't drill it down, __memcpy_avx_unaligned /lib64/libc-2.22.so Huh? No selection. Report to linux-kernel@vger.kernel.org While other functions can, get_io_u /usr/local/bin/fio 0.08 │ xor %eax,%esi 0.01 │ shl $0x6,%rsi 0.02 │ and $0x1c0,%esi 0.00 │ sub %rax,%rsi 0.12 │ add %rdx,%rsi 0.00 │ sub %rax,%rdx │ nop │ /* │ * Fill offset into start of cacheline, time into end │ * of cacheline │ */ │ iptr = (void *) p + (offset << 6); │ *iptr = boffset; 16.01 │ 230: mov %rax,(%rsi,%rax,1) │ │ iptr = (void *) p + 64 - 2 * sizeof(uint64_t); │ iptr[0] = io_u->start_time.tv_sec; 7.03 │ mov 0x0(%rbp),%rcx 13.52 │ mov %rcx,0x30(%rdx,%rax,1) -----Original Message----- From: Jens Axboe [mailto:axboe@kernel.dk] Sent: Thursday, November 30, 2017 11:24 PM To: Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com>; Robert Elliott (Persistent Memory) <elliott@hpe.com>; Rebecca Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com> Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com> Subject: Re: fio 3.2 On 11/30/2017 07:17 AM, Gavriliuk, Anton (HPS Ukraine) wrote: > It's there any chance to fix it ? > > dl560g10spmem01:/var/work # /usr/local/bin/fio --filename=/dev/dax0.0 > --rw=randrw --refill_buffers --randrepeat=0 --ioengine=mmap > --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 > --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting > --name=4-rand-rw-3xx --size=290g > 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) > 4096B-256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1 ... > fio-2.99 > Starting 16 processes > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 > > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 > > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 > > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 > > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 > > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 > > 4-rand-rw-3xx: failed to get file size of /dev/dax0.0 Should already be fixed as of yesterday morning. -- Jens Axboe ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: fio 3.2 2017-11-30 4:21 ` Jens Axboe 2017-11-30 14:17 ` Gavriliuk, Anton (HPS Ukraine) @ 2017-11-30 16:13 ` Jens Axboe 2017-11-30 16:18 ` Jens Axboe 1 sibling, 1 reply; 39+ messages in thread From: Jens Axboe @ 2017-11-30 16:13 UTC (permalink / raw) To: Elliott, Robert (Persistent Memory), Gavriliuk, Anton (HPS Ukraine), Rebecca Cran, Sitsofe Wheeler Cc: fio@vger.kernel.org, Kani, Toshimitsu On 11/29/2017 09:21 PM, Jens Axboe wrote: > On 11/28/2017 09:13 PM, Elliott, Robert (Persistent Memory) wrote: >> small_content_scramble has hardly been touched since 2011, so it probably >> hasn't had much performance analysis. > > That's fair, would be a good thing to look at, especially since it's on > by default. Something like this might be an improvement. The main change here is that for each 512b chunk in the io_u buffer, we generate a "random" index between 0..7 and scramble the start and end of that with the offset and time. The difference here is that we do it all within a 64b range within each chunk, which should fall into the same cacheline since the io_u buffer is generally aligned. This cuts down on the number of cachelines we dirty for each io_u, from a max of 16 to 8. We could further reduce this to 7, if we generate an overlapping cacheline between chunks. That would place the data in the same spot everytime, which isn't ideal though. I ran a quick null benchmark with this, and on my laptop it brings us from 4061k/4068k to 4090k/4091k. Those are results from two runs, so not very conclusive or definitive... Suggestions and tests would be welcome. FWIW, this is what I ran: ./fio --name=null --size=100g --rw=write --ioengine=null --gtod_reduce=1 --scramble_buffers=1 --iodepth=64 --direct=1 --cpus_allowed=2 diff --git a/io_u.c b/io_u.c index 086384a1c655..6bb9eabf1cb2 100644 --- a/io_u.c +++ b/io_u.c @@ -1669,32 +1669,40 @@ static bool check_get_verify(struct thread_data *td, struct io_u *io_u) */ static void small_content_scramble(struct io_u *io_u) { - unsigned int i, nr_blocks = io_u->buflen / 512; + unsigned int i, nr_blocks = io_u->buflen >> 9; unsigned int offset; - uint64_t boffset; - char *p, *end; + uint64_t boffset, *iptr; + char *p; if (!nr_blocks) return; p = io_u->xfer_buf; boffset = io_u->offset; - io_u->buf_filled_len = 0; + + if (io_u->buf_filled_len) + io_u->buf_filled_len = 0; + + /* + * Generate random index between 0..7. We do chunks of 512b, if + * we assume a cacheline is 64 bytes, then we have 8 of those. + * Scramble content within the blocks in the same cacheline to + * speed things up. + */ + offset = (io_u->start_time.tv_nsec ^ boffset) & 7; for (i = 0; i < nr_blocks; i++) { /* - * Fill the byte offset into a "random" start offset of - * the first half of the buffer. + * Fill offset into start of cacheline, time into end + * of cacheline */ - offset = (io_u->start_time.tv_nsec ^ boffset) & 255; - offset &= ~(sizeof(boffset) - 1); - memcpy(p + offset, &boffset, sizeof(boffset)); + iptr = (void *) p + (offset << 6); + *iptr = boffset; + + iptr = (void *) p + 64 - 2 * sizeof(uint64_t); + iptr[0] = io_u->start_time.tv_sec; + iptr[1] = io_u->start_time.tv_nsec; - /* - * Fill the start time into the end of the buffer - */ - end = p + 512 - sizeof(io_u->start_time); - memcpy(end, &io_u->start_time, sizeof(io_u->start_time)); p += 512; boffset += 512; } -- Jens Axboe ^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: fio 3.2 2017-11-30 16:13 ` Jens Axboe @ 2017-11-30 16:18 ` Jens Axboe 0 siblings, 0 replies; 39+ messages in thread From: Jens Axboe @ 2017-11-30 16:18 UTC (permalink / raw) To: Elliott, Robert (Persistent Memory), Gavriliuk, Anton (HPS Ukraine), Rebecca Cran, Sitsofe Wheeler Cc: fio@vger.kernel.org, Kani, Toshimitsu On 11/30/2017 09:13 AM, Jens Axboe wrote: > On 11/29/2017 09:21 PM, Jens Axboe wrote: >> On 11/28/2017 09:13 PM, Elliott, Robert (Persistent Memory) wrote: >>> small_content_scramble has hardly been touched since 2011, so it probably >>> hasn't had much performance analysis. >> >> That's fair, would be a good thing to look at, especially since it's on >> by default. > > Something like this might be an improvement. The main change here is > that for each 512b chunk in the io_u buffer, we generate a "random" > index between 0..7 and scramble the start and end of that with the > offset and time. The difference here is that we do it all within a 64b > range within each chunk, which should fall into the same cacheline since > the io_u buffer is generally aligned. This cuts down on the number of > cachelines we dirty for each io_u, from a max of 16 to 8. > > We could further reduce this to 7, if we generate an overlapping > cacheline between chunks. That would place the data in the same spot > everytime, which isn't ideal though. > > I ran a quick null benchmark with this, and on my laptop it brings us > from 4061k/4068k to 4090k/4091k. Those are results from two runs, so not > very conclusive or definitive... Suggestions and tests would be welcome. > FWIW, this is what I ran: > > ./fio --name=null --size=100g --rw=write --ioengine=null --gtod_reduce=1 --scramble_buffers=1 --iodepth=64 --direct=1 --cpus_allowed=2 Ran the same thing on a server box, which generated very reliable results. It takes us from 3200K to 3340K. I'm going to commit this, but don't let that stop anyone from testing it and making suggestions on how we can further improve it. -- Jens Axboe ^ permalink raw reply [flat|nested] 39+ messages in thread
[parent not found: <F5206176-8A92-46B4-9092-F901C132E22C@bluestop.org>]
* Re: fio 3.2 [not found] ` <F5206176-8A92-46B4-9092-F901C132E22C@bluestop.org> @ 2017-11-28 8:38 ` Sitsofe Wheeler 0 siblings, 0 replies; 39+ messages in thread From: Sitsofe Wheeler @ 2017-11-28 8:38 UTC (permalink / raw) To: Rebecca Cran Cc: Gavriliuk, Anton (HPS Ukraine), fio@vger.kernel.org, Robert Elliott Thanks Rebecca. On 27 November 2017 at 19:57, Rebecca Cran <rebecca@bluestop.org> wrote: > If you don’t have a pmem device, Linux can emulate one using RAM and the > ‘memmap’ kernel parameter. See http://pmem.io/2016/02/22/pm-emulation.html > for more info. > > Rebecca > > Sent from my iPhone > > On Nov 27, 2017, at 12:38 PM, Sitsofe Wheeler <sitsofe@gmail.com> wrote: > > Unfortunately I don't have access to a pmem device but let's see how far we > get: -- Sitsofe | http://sucs.org/~sits/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: fio 3.2 2017-11-27 19:38 ` Sitsofe Wheeler 2017-11-27 22:48 ` Rebecca Cran [not found] ` <F5206176-8A92-46B4-9092-F901C132E22C@bluestop.org> @ 2017-11-28 14:25 ` Gavriliuk, Anton (HPS Ukraine) 2017-11-29 0:04 ` Sitsofe Wheeler 2 siblings, 1 reply; 39+ messages in thread From: Gavriliuk, Anton (HPS Ukraine) @ 2017-11-28 14:25 UTC (permalink / raw) To: Sitsofe Wheeler Cc: fio@vger.kernel.org, Brian Boylston, Elliott, Robert (Persistent Memory) Ok, now I fixed that. Seems to be sequence of the keys is important. I fixed that. Now I have next issue :-) I trying to speed up customer’s script. Initially it was with /dev/pmem & libaio engine, dl560g10spmem01:~/FIO-PMEM_TestScripts/DAX # /usr/bin/fio --filename=/dev/pmem1 --direct=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw-3xx 4-rand-rw-3xx: (g=0): rw=randrw, bs=4K-256K/4K-256K/4K-256K, ioengine=libaio, iodepth=1 ... fio-2.12 Starting 16 processes Jobs: 2 (f=2): [_(5),m(1),_(7),m(1),_(2)] [100.0% done] [357.4MB/6990MB/0KB /s] [7066/137K/0 iops] [eta 00m:00s] 4-rand-rw-3xx: (groupid=0, jobs=16): err= 0: pid=23144: Mon Nov 27 19:56:04 2017 read : io=245977MB, bw=1592.8MB/s, iops=31315, runt=154436msec slat (usec): min=1, max=77, avg=10.34, stdev= 6.00 clat (usec): min=0, max=6, avg= 0.20, stdev= 0.40 lat (usec): min=1, max=77, avg=10.57, stdev= 6.01 clat percentiles (usec): | 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 0], | 30.00th=[ 0], 40.00th=[ 0], 50.00th=[ 0], 60.00th=[ 0], | 70.00th=[ 0], 80.00th=[ 1], 90.00th=[ 1], 95.00th=[ 1], | 99.00th=[ 1], 99.50th=[ 1], 99.90th=[ 1], 99.95th=[ 1], | 99.99th=[ 1] bw (KB /s): min=38528, max=118396, per=6.36%, avg=103720.16, stdev=8779.81 write: io=4559.9GB, bw=30234MB/s, iops=594518, runt=154436msec slat (usec): min=1, max=787, avg=16.46, stdev=10.00 clat (usec): min=0, max=680, avg= 0.20, stdev= 0.41 lat (usec): min=1, max=789, avg=16.69, stdev=10.00 clat percentiles (usec): | 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 0], | 30.00th=[ 0], 40.00th=[ 0], 50.00th=[ 0], 60.00th=[ 0], | 70.00th=[ 0], 80.00th=[ 1], 90.00th=[ 1], 95.00th=[ 1], | 99.00th=[ 1], 99.50th=[ 1], 99.90th=[ 1], 99.95th=[ 1], | 99.99th=[ 1] bw (MB /s): min= 731, max= 2015, per=6.36%, avg=1922.69, stdev=146.41 lat (usec) : 2=100.00%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01% lat (usec) : 250=0.01%, 750=0.01% cpu : usr=35.61%, sys=64.37%, ctx=2509, majf=0, minf=930 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=4836244/w=91815119/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: io=245977MB, aggrb=1592.8MB/s, minb=1592.8MB/s, maxb=1592.8MB/s, mint=154436msec, maxt=154436msec WRITE: io=4559.9GB, aggrb=30234MB/s, minb=30234MB/s, maxb=30234MB/s, mint=154436msec, maxt=154436msec Now I want saving customer’s I/O pattern, but switch to the DAX access (/dev/dax0.0 raw device) with mmap engine, trying to bypass OS block device & system call layers. So I modified script, dl560g10spmem01:~/FIO-PMEM_TestScripts/DAX # /usr/bin/fio --name=4-rand-rw-3xx --ioengine=mmap --iodepth=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --size=290g --numjobs=16 --group_reporting --runtime=120 --filename=/dev/dax0.0 4-rand-rw-3xx: (g=0): rw=randrw, bs=4K-256K/4K-256K/4K-256K, ioengine=mmap, iodepth=1 ... fio-2.12 Starting 16 processes Jobs: 12 (f=12): [m(3),_(2),m(1),_(1),m(6),_(1),m(2)] [78.6% done] [1304MB/24907MB/0KB /s] [25.7K/490K/0 iops] [eta 00m:33s] 4-rand-rw-3xx: (groupid=0, jobs=16): err= 0: pid=65032: Tue Nov 28 06:58:09 2017 read : io=202831MB, bw=1690.3MB/s, iops=33231, runt=120001msec clat (usec): min=0, max=211, avg= 8.88, stdev= 5.23 lat (usec): min=0, max=211, avg= 8.92, stdev= 5.23 clat percentiles (usec): | 1.00th=[ 1], 5.00th=[ 1], 10.00th=[ 2], 20.00th=[ 4], | 30.00th=[ 6], 40.00th=[ 8], 50.00th=[ 10], 60.00th=[ 11], | 70.00th=[ 11], 80.00th=[ 12], 90.00th=[ 13], 95.00th=[ 14], | 99.00th=[ 29], 99.50th=[ 43], 99.90th=[ 48], 99.95th=[ 49], | 99.99th=[ 53] bw (KB /s): min=86408, max=159184, per=6.36%, avg=110086.50, stdev=19409.13 write: io=3764.3GB, bw=32121MB/s, iops=631607, runt=120001msec clat (usec): min=0, max=744, avg=15.16, stdev=10.09 lat (usec): min=0, max=744, avg=15.21, stdev=10.09 clat percentiles (usec): | 1.00th=[ 1], 5.00th=[ 2], 10.00th=[ 3], 20.00th=[ 6], | 30.00th=[ 11], 40.00th=[ 12], 50.00th=[ 12], 60.00th=[ 20], | 70.00th=[ 21], 80.00th=[ 22], 90.00th=[ 23], 95.00th=[ 24], | 99.00th=[ 46], 99.50th=[ 86], 99.90th=[ 91], 99.95th=[ 92], | 99.99th=[ 96] bw (MB /s): min= 1760, max= 2737, per=6.36%, avg=2043.01, stdev=353.05 lat (usec) : 2=4.63%, 4=8.45%, 10=11.27%, 20=32.41%, 50=42.58% lat (usec) : 100=0.66%, 250=0.01%, 500=0.01%, 750=0.01% cpu : usr=99.74%, sys=0.24%, ctx=2791, majf=0, minf=2822710 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=3987782/w=75793478/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: io=202831MB, aggrb=1690.3MB/s, minb=1690.3MB/s, maxb=1690.3MB/s, mint=120001msec, maxt=120001msec WRITE: io=3764.3GB, aggrb=32121MB/s, minb=32121MB/s, maxb=32121MB/s, mint=120001msec, maxt=120001msec No huge growth detected – lat 16.69 vs 15.21, write bw – 30234Mb/s vs 32121Mb/s. Looking to the performance traces, PID 65043 /usr/bin/fio HARDCLOCK entries Count Pct State Function 2210 66.29% USER __memcpy_avx_unaligned [/lib64/libc-2.22.so] 1122 33.65% USER UNKNOWN 2 0.06% SYS pagerange_is_ram_callback Count Pct HARDCLOCK Stack trace ============================================================ 2 0.06% pagerange_is_ram_callback walk_system_ram_range pat_pagerange_is_ram lookup_memtype track_pfn_insert vmf_insert_pfn_pmd dax_dev_pmd_fault handle_mm_fault __do_page_fault do_page_fault page_fault unknown | __memcpy_avx_unaligned 25.620925 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 25.644924 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 25.656923 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 25.680925 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 25.692924 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 25.704925 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 25.716925 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 25.728923 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 25.764924 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 25.776924 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 Seems to be there lot's unaligned memory access, any ideas ? You could also look at the source code for fio to see what it’s doing or put in some printf() statements so you can understand what the actual alignment is. Anton -----Original Message----- From: Sitsofe Wheeler [mailto:sitsofe@gmail.com] Sent: Monday, November 27, 2017 9:38 PM To: Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com> Cc: fio@vger.kernel.org; Brian Boylston <brian.boylston@hpe.com>; Elliott, Robert (Persistent Memory) <elliott@hpe.com> Subject: Re: fio 3.2 Unfortunately I don't have access to a pmem device but let's see how far we get: On 27 November 2017 at 12:39, Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com> wrote: > > result=$(fio --name=random-writers --ioengine=mmap --iodepth=32 > --rw=randwrite --bs=64k --size=1024m --numjobs=8 --group_reporting=1 > --eta=never --time_based --runtime=60 --filename=/dev/dax0.0 | grep > WRITE) Please make your problem scenarios as simple as possible: 1. Just run fio normally so we can see the output it produces on both stdout and stderr normally. 2. Reduce the job back so it there are only the bare minimum options that reproduce the problem. 3. Try to avoid changing lots of things. Here you've switched ioengine introducing another place to look. Instead how about this: Was your /dev/dax0.0 device made using ndctl? Assuming yes: fio --name=dax-mmap --ioengine=mmap --rw=write --bs=64k --eta=never --time_based --runtime=60 --filename=/dev/dax0.0 --size=2g (apparently a size has to be specified when you try to use a character device - see https://nvdimm.wiki.kernel.org/ ) If you run just that by itself what do you see? Next up: fio --name=dax-dax --ioengine=dev-dax --rw=write --bs=64k --eta=never --time_based --runtime=60 --filename=/dev/dax0.0 --size=2g If you run just that by itself what do you see? Finally: Assuming a -o dax mounted filesystem on /pmem0/ : fio --name=libpmemblk --ioengine=pmemblk --rw=write --bs=64k --eta=never --time_based --runtime=60 --filename=/pmem0/fio-test,4096,1024 --thread=1 If you run just that by itself what do you see? (Perhaps the documentation for these ioengines and pmem devices needs to be improved?) -- Sitsofe | http://sucs.org/~sits/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: fio 3.2 2017-11-28 14:25 ` Gavriliuk, Anton (HPS Ukraine) @ 2017-11-29 0:04 ` Sitsofe Wheeler 0 siblings, 0 replies; 39+ messages in thread From: Sitsofe Wheeler @ 2017-11-29 0:04 UTC (permalink / raw) To: Gavriliuk, Anton (HPS Ukraine) Cc: fio@vger.kernel.org, Elliott, Robert (Persistent Memory) Anton, Did you see Elliot's suggestions (i.e. point 4) in his previous email (http://www.spinics.net/lists/fio/msg06561.html ) for speeding things up and how you are at the mercy of glibc? On 28 November 2017 at 14:25, Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com> wrote: > Ok, now I fixed that. Seems to be sequence of the keys is important. I fixed that. Now I have next issue :-) > > I trying to speed up customer’s script. [snip] > lat (usec): min=1, max=77, avg=10.57, stdev= 6.01 > clat percentiles (usec): > | 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 0], > | 30.00th=[ 0], 40.00th=[ 0], 50.00th=[ 0], 60.00th=[ 0], > | 70.00th=[ 0], 80.00th=[ 1], 90.00th=[ 1], 95.00th=[ 1], > | 99.00th=[ 1], 99.50th=[ 1], 99.90th=[ 1], 99.95th=[ 1], > | 99.99th=[ 1] > bw (KB /s): min=38528, max=118396, per=6.36%, avg=103720.16, stdev=8779.81 > write: io=4559.9GB, bw=30234MB/s, iops=594518, runt=154436msec > slat (usec): min=1, max=787, avg=16.46, stdev=10.00 > clat (usec): min=0, max=680, avg= 0.20, stdev= 0.41 > lat (usec): min=1, max=789, avg=16.69, stdev=10.00 > clat percentiles (usec): > | 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 0], > | 30.00th=[ 0], 40.00th=[ 0], 50.00th=[ 0], 60.00th=[ 0], > | 70.00th=[ 0], 80.00th=[ 1], 90.00th=[ 1], 95.00th=[ 1], > | 99.00th=[ 1], 99.50th=[ 1], 99.90th=[ 1], 99.95th=[ 1], > | 99.99th=[ 1] > bw (MB /s): min= 731, max= 2015, per=6.36%, avg=1922.69, stdev=146.41 > lat (usec) : 2=100.00%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01% > lat (usec) : 250=0.01%, 750=0.01% > cpu : usr=35.61%, sys=64.37%, ctx=2509, majf=0, minf=930 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued : total=r=4836244/w=91815119/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 > latency : target=0, window=0, percentile=100.00%, depth=1 > > Run status group 0 (all jobs): > READ: io=245977MB, aggrb=1592.8MB/s, minb=1592.8MB/s, maxb=1592.8MB/s, mint=154436msec, maxt=154436msec > WRITE: io=4559.9GB, aggrb=30234MB/s, minb=30234MB/s, maxb=30234MB/s, mint=154436msec, maxt=154436msec > > Now I want saving customer’s I/O pattern, but switch to the DAX access (/dev/dax0.0 raw device) with mmap engine, trying to bypass OS block device & system call layers. So I modified script, > > dl560g10spmem01:~/FIO-PMEM_TestScripts/DAX # /usr/bin/fio --name=4-rand-rw-3xx --ioengine=mmap --iodepth=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --size=290g --numjobs=16 --group_reporting --runtime=120 --filename=/dev/dax0.0 > 4-rand-rw-3xx: (g=0): rw=randrw, bs=4K-256K/4K-256K/4K-256K, ioengine=mmap, iodepth=1 > ... > fio-2.12 > Starting 16 processes > Jobs: 12 (f=12): [m(3),_(2),m(1),_(1),m(6),_(1),m(2)] [78.6% done] [1304MB/24907MB/0KB /s] [25.7K/490K/0 iops] [eta 00m:33s] > 4-rand-rw-3xx: (groupid=0, jobs=16): err= 0: pid=65032: Tue Nov 28 06:58:09 2017 > read : io=202831MB, bw=1690.3MB/s, iops=33231, runt=120001msec > clat (usec): min=0, max=211, avg= 8.88, stdev= 5.23 > lat (usec): min=0, max=211, avg= 8.92, stdev= 5.23 > clat percentiles (usec): > | 1.00th=[ 1], 5.00th=[ 1], 10.00th=[ 2], 20.00th=[ 4], > | 30.00th=[ 6], 40.00th=[ 8], 50.00th=[ 10], 60.00th=[ 11], > | 70.00th=[ 11], 80.00th=[ 12], 90.00th=[ 13], 95.00th=[ 14], > | 99.00th=[ 29], 99.50th=[ 43], 99.90th=[ 48], 99.95th=[ 49], > | 99.99th=[ 53] > bw (KB /s): min=86408, max=159184, per=6.36%, avg=110086.50, stdev=19409.13 > write: io=3764.3GB, bw=32121MB/s, iops=631607, runt=120001msec > clat (usec): min=0, max=744, avg=15.16, stdev=10.09 > lat (usec): min=0, max=744, avg=15.21, stdev=10.09 > clat percentiles (usec): > | 1.00th=[ 1], 5.00th=[ 2], 10.00th=[ 3], 20.00th=[ 6], > | 30.00th=[ 11], 40.00th=[ 12], 50.00th=[ 12], 60.00th=[ 20], > | 70.00th=[ 21], 80.00th=[ 22], 90.00th=[ 23], 95.00th=[ 24], > | 99.00th=[ 46], 99.50th=[ 86], 99.90th=[ 91], 99.95th=[ 92], > | 99.99th=[ 96] > bw (MB /s): min= 1760, max= 2737, per=6.36%, avg=2043.01, stdev=353.05 > lat (usec) : 2=4.63%, 4=8.45%, 10=11.27%, 20=32.41%, 50=42.58% > lat (usec) : 100=0.66%, 250=0.01%, 500=0.01%, 750=0.01% > cpu : usr=99.74%, sys=0.24%, ctx=2791, majf=0, minf=2822710 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued : total=r=3987782/w=75793478/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 > latency : target=0, window=0, percentile=100.00%, depth=1 > > Run status group 0 (all jobs): > READ: io=202831MB, aggrb=1690.3MB/s, minb=1690.3MB/s, maxb=1690.3MB/s, mint=120001msec, maxt=120001msec > WRITE: io=3764.3GB, aggrb=32121MB/s, minb=32121MB/s, maxb=32121MB/s, mint=120001msec, maxt=120001msec > > No huge growth detected – lat 16.69 vs 15.21, write bw – 30234Mb/s vs 32121Mb/s. > > Looking to the performance traces, > > PID 65043 /usr/bin/fio > > HARDCLOCK entries > Count Pct State Function > 2210 66.29% USER __memcpy_avx_unaligned [/lib64/libc-2.22.so] > 1122 33.65% USER UNKNOWN > 2 0.06% SYS pagerange_is_ram_callback > > Count Pct HARDCLOCK Stack trace > ============================================================ > 2 0.06% pagerange_is_ram_callback walk_system_ram_range pat_pagerange_is_ram lookup_memtype track_pfn_insert vmf_insert_pfn_pmd dax_dev_pmd_fault handle_mm_fault __do_page_fault do_page_fault page_fault unknown | __memcpy_avx_unaligned > > 25.620925 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 > 25.644924 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 > 25.656923 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 > 25.680925 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 > 25.692924 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 > 25.704925 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 > 25.716925 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 > 25.728923 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 > 25.764924 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 > 25.776924 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 > > Seems to be there lot's unaligned memory access, any ideas ? > > You could also look at the source code for fio to see what it’s doing or put in some printf() statements so you can understand what the actual alignment is. > > Anton -- Sitsofe | http://sucs.org/~sits/ ^ permalink raw reply [flat|nested] 39+ messages in thread
end of thread, other threads:[~2017-12-05 20:29 UTC | newest]
Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <TU4PR8401MB099112614914A9232A07490DEC270@TU4PR8401MB0991.NAMPRD84.PROD.OUTLOOK.COM>
[not found] ` <903d418b-bac4-104b-28e5-3c529efab7f5@kernel.dk>
2017-11-26 5:30 ` fio 3.2 Gavriliuk, Anton (HPS Ukraine)
2017-11-26 15:28 ` Sitsofe Wheeler
2017-11-27 12:39 ` Gavriliuk, Anton (HPS Ukraine)
2017-11-27 19:38 ` Sitsofe Wheeler
2017-11-27 22:48 ` Rebecca Cran
2017-11-28 1:12 ` Gavriliuk, Anton (HPS Ukraine)
2017-11-28 4:51 ` Elliott, Robert (Persistent Memory)
2017-11-29 3:35 ` Gavriliuk, Anton (HPS Ukraine)
2017-11-29 4:13 ` Elliott, Robert (Persistent Memory)
2017-11-29 4:44 ` Rebecca Cran
2017-11-30 4:04 ` Jens Axboe
2017-11-29 6:55 ` Gavriliuk, Anton (HPS Ukraine)
2017-11-29 7:40 ` Gavriliuk, Anton (HPS Ukraine)
2017-11-29 8:22 ` Elliott, Robert (Persistent Memory)
2017-11-29 13:12 ` Gavriliuk, Anton (HPS Ukraine)
2017-11-29 17:17 ` Sitsofe Wheeler
2017-11-29 20:24 ` Elliott, Robert (Persistent Memory)
2017-11-30 4:21 ` Jens Axboe
2017-11-30 14:17 ` Gavriliuk, Anton (HPS Ukraine)
2017-11-30 21:23 ` Jens Axboe
2017-12-01 5:21 ` Gavriliuk, Anton (HPS Ukraine)
2017-12-01 7:15 ` Robert Elliott (Persistent Memory)
2017-12-03 9:35 ` Sitsofe Wheeler
2017-12-03 17:10 ` Jens Axboe
2017-12-05 20:29 ` Elliott, Robert (Persistent Memory)
2017-12-03 3:24 ` Gavriliuk, Anton (HPS Ukraine)
2017-12-04 17:12 ` Gavriliuk, Anton (HPS Ukraine)
2017-12-04 17:30 ` Robert Elliott (Persistent Memory)
2017-12-04 18:25 ` Gavriliuk, Anton (HPS Ukraine)
2017-12-04 18:50 ` Jens Axboe
2017-12-04 19:17 ` Jeff Furlong
2017-12-04 19:29 ` Gavriliuk, Anton (HPS Ukraine)
2017-12-04 18:41 ` Jens Axboe
2017-12-04 17:30 ` Gavriliuk, Anton (HPS Ukraine)
2017-11-30 16:13 ` Jens Axboe
2017-11-30 16:18 ` Jens Axboe
[not found] ` <F5206176-8A92-46B4-9092-F901C132E22C@bluestop.org>
2017-11-28 8:38 ` Sitsofe Wheeler
2017-11-28 14:25 ` Gavriliuk, Anton (HPS Ukraine)
2017-11-29 0:04 ` Sitsofe Wheeler
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.