RE: fio 3.2

All of lore.kernel.org
 help / color / mirror / Atom feed

* RE: fio 3.2
       [not found] ` <903d418b-bac4-104b-28e5-3c529efab7f5@kernel.dk>
@ 2017-11-26  5:30   ` Gavriliuk, Anton (HPS Ukraine)
  2017-11-26 15:28     ` Sitsofe Wheeler
  0 siblings, 1 reply; 39+ messages in thread
From: Gavriliuk, Anton (HPS Ukraine) @ 2017-11-26  5:30 UTC (permalink / raw)
  To: fio@vger.kernel.org

OK, I rebuild 3.2 with libaio & libpmem libraries,

dl560g10spmem01:/var/work/fio-3.2 # rpm -qa|grep libpmem
libpmemblk-1.3-3.fc27.x86_64
libpmempool-devel-1.3-3.fc27.x86_64
libpmemlog-1.3-3.fc27.x86_64
libpmem-devel-1.3-3.fc27.x86_64
libpmemobj-devel-1.3-3.fc27.x86_64
libpmem-1.3-3.fc27.x86_64
libpmemobj-1.3-3.fc27.x86_64
libpmemblk-devel-1.3-3.fc27.x86_64
libpmempool-1.3-3.fc27.x86_64
libpmemlog-devel-1.3-3.fc27.x86_64

dl560g10spmem01:/var/work/fio-3.2 # ./configure
Operating system              Linux
CPU                           x86_64
Big endian                    no
Compiler                      gcc
Cross compile                 no
Static build                  no
Wordsize                      64
zlib                          yes
Linux AIO support             yes
POSIX AIO support             yes
POSIX AIO support needs -lrt  yes
POSIX AIO fsync               yes
POSIX pshared support         yes
Solaris AIO support           no
__sync_fetch_and_add          yes
__sync_synchronize            yes
__sync_val_compare_and_swap   yes
libverbs                      no
rdmacm                        no
Linux fallocate               yes
POSIX fadvise                 yes
POSIX fallocate               yes
sched_setaffinity(3 arg)      yes
sched_setaffinity(2 arg)      no
clock_gettime                 yes
CLOCK_MONOTONIC               yes
CLOCK_MONOTONIC_RAW           yes
CLOCK_MONOTONIC_PRECISE       no
clockid_t                     yes
gettimeofday                  yes
fdatasync                     yes
sync_file_range               yes
EXT4 move extent              yes
Linux splice(2)               yes
GUASI                         no
Fusion-io atomic engine       no
libnuma                       yes
libnuma v2                    yes
strsep                        yes
strcasestr                    yes
strlcat                       no
getopt_long_only()            yes
inet_aton                     yes
socklen_t                     yes
__thread                      yes
RUSAGE_THREAD                 yes
SCHED_IDLE                    yes
TCP_NODELAY                   yes
Net engine window_size        yes
TCP_MAXSEG                    yes
RLIMIT_MEMLOCK                yes
pwritev/preadv                yes
pwritev2/preadv2              no
IPv6 helpers                  yes
Rados Block Device engine     no
rbd blkin tracing             no
setvbuf                       yes
Gluster API engine            no
s390_z196_facilities          no
HDFS engine                   no
MTD                           yes
libpmem                       yes
libpmemblk                    yes
NVML pmemblk engine           yes
NVML dev-dax engine           yes
lex/yacc for arithmetic       no
getmntent                     yes
getmntinfo                    no
Static Assert                 yes
bool                          yes
strndup                       yes
march_armv8_a_crc_crypto      no
cuda                          no
mkdir(a, b)                   yes
dl560g10spmem01:/var/work/fio-3.2 # make
    CC crc/crc16.o
    CC crc/crc32.o
    CC crc/crc32c-arm64.o
    CC crc/crc32c-intel.o
    CC crc/crc32c.o
    CC crc/crc64.o
    CC crc/crc7.o
    CC crc/fnv.o
    CC crc/md5.o
    CC crc/murmur3.o
    CC crc/sha1.o
    CC crc/sha256.o
    CC crc/sha3.o
    CC crc/sha512.o
    CC crc/test.o
    CC crc/xxhash.o
    CC lib/axmap.o
    CC lib/bloom.o
    CC lib/flist_sort.o
    CC lib/gauss.o
    CC lib/getrusage.o
    CC lib/hweight.o
    CC lib/ieee754.o
    CC lib/lfsr.o
    CC lib/memalign.o
    CC lib/mountcheck.o
    CC lib/num2str.o
    CC lib/output_buffer.o
    CC lib/pattern.o
    CC lib/prio_tree.o
    CC lib/rand.o
    CC lib/rbtree.o
    CC lib/strntol.o
    CC lib/zipf.o
    CC gettime.o
    CC ioengines.o
    CC init.o
    CC stat.o
    CC log.o
    CC time.o
    CC filesetup.o
    CC eta.o
    CC verify.o
    CC memory.o
    CC io_u.o
    CC parse.o
    CC mutex.o
    CC options.o
    CC smalloc.o
    CC filehash.o
    CC profile.o
    CC debug.o
    CC engines/cpu.o
    CC engines/mmap.o
    CC engines/sync.o
    CC engines/null.o
    CC engines/net.o
    CC engines/ftruncate.o
    CC engines/filecreate.o
    CC server.o
    CC client.o
    CC iolog.o
    CC backend.o
    CC libfio.o
    CC flow.o
    CC cconv.o
    CC gettime-thread.o
    CC helpers.o
    CC json.o
    CC idletime.o
    CC td_error.o
    CC profiles/tiobench.o
    CC profiles/act.o
    CC io_u_queue.o
    CC filelock.o
    CC workqueue.o
    CC rate-submit.o
    CC optgroup.o
    CC helper_thread.o
    CC steadystate.o
    CC engines/libaio.o
    CC engines/posixaio.o
    CC engines/falloc.o
    CC engines/e4defrag.o
    CC engines/splice.o
    CC oslib/strlcat.o
    CC engines/mtd.o
    CC oslib/libmtd.o
    CC oslib/libmtd_legacy.o
    CC engines/pmemblk.o
    CC engines/dev-dax.o
    CC diskutil.o
    CC fifo.o
    CC blktrace.o
    CC cgroup.o
    CC trim.o
    CC engines/sg.o
    CC engines/binject.o
    CC oslib/linux-dev-lookup.o
  LINK fio
  LINK t/fio-genzipf
  LINK t/fio-btrace2fio
  LINK t/fio-dedupe
  LINK t/fio-verify-state
  LINK t/stest
  LINK t/ieee754
  LINK t/axmap
  LINK t/lfsr-test
  LINK t/gen-rand
dl560g10spmem01:/var/work/fio-3.2 # /usr/local/bin/fio --enghelp
Available IO engines:
        cpuio
        mmap
        sync
        psync
        vsync
        pvsync
        pvsync2
        null
        net
        netsplice
        ftruncate
        filecreate
        libaio
        posixaio
        falloc
        e4defrag
        splice
        mtd
        pmemblk
        dev-dax
        sg
        binject
dl560g10spmem01:/var/work/fio-3.2 # /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=dev-dax --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=1 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --size=300Gb --name=4-rand-rw-3xx
4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=dev-dax, iodepth=1
4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=dev-dax, iodepth=1
fio-3.2
Starting 2 processes

Run status group 0 (all jobs):
dl560g10spmem01:/var/work/fio-3.2 #


/dev/pmem1 on /pmem type xfs (rw,relatime,attr2,dax,inode64,noquota)
dl560g10spmem01:/var/work/fio-3.2 # /usr/local/bin/fio --filename=/dev/pmem1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=pmemblk --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=1 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --allow_mounted_write=1 --thread=1 --direct=1
4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=pmemblk, iodepth=1
fio-3.2
Starting 1 thread
pmemblk: unable to open pmemblk pool file /dev/pmem1 (Invalid argument)

Run status group 0 (all jobs):
dl560g10spmem01:/var/work/fio-3.2 #


dl560g10spmem01:/var/work/fio-3.2 # ll /dev/dax0.0
-rw-r--r-- 1 root root 322122547200 Nov 25 22:16 /dev/dax0.0
dl560g10spmem01:/var/work/fio-3.2 # ll /dev/pmem1
brw-rw---- 1 root disk 259, 3 Nov 25 21:51 /dev/pmem1

No error messages in dmesg/messages.

And the libpmem engine still not available.

Anton

-----Original Message-----
From: Jens Axboe [mailto:axboe@kernel.dk] 
Sent: Saturday, November 25, 2017 7:40 PM
To: Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com>
Subject: Re: fio 3.2

See the BUILDING section of the README. You are missing libaio-dev(el) and ditto for pmem. When you run configure, it shows you what it detects and what it doesn't.

In the future, please send support questions to the reflector, as also described in the documentation, fio@vger.kernel.org. I can't cater to all personal support requests, and there are plenty of other folks on that list that can help you out, especially on basic questions like this.


On 11/25/2017 08:50 AM, Gavriliuk, Anton (HPS Ukraine) wrote:
> In 3.2 version libaio engine also missed., can't be loaded
> 
> ï¿½
> 
> Anton
> 
> ï¿½
> 
> *From:* Gavriliuk, Anton (HPS Ukraine)
> *Sent:* Saturday, November 25, 2017 9:12 AM
> *To:* 'jens.axboe@oracle.com' <jens.axboe@oracle.com>; 
> 'jaxboe@fusionio.com' <jaxboe@fusionio.com>; 'axboe@fb.com' 
> <axboe@fb.com>
> *Subject:* fio 3.2
> 
> ï¿½
> 
> Hello Jens
> 
> ï¿½
> 
> I'm testing persistent memory using latest fio 3.2
> 
> ï¿½
> 
> http://fio.readthedocs.io/en/latest/fio_doc.html#building 
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__fio.readthedocs.i
> o_en_latest_fio-5Fdoc.html-23building&d=DwMFAg&c=5VD0RTtNlTh3ycd41b3MU
> w&r=cK1a7KivzZRh1fKQMjSm2A&m=gF-hXqu35gNKL7tvHVTPbXTfipn8kMan45lnr9z1C
> kA&s=9wSt7cGFX26GOxsXLnQ6pALYuxas7yjYsYsV-AWIMhA&e=>
> 
> ï¿½
> 
> 1.12.10. I/O engine
> 
> pmemblk
> 
> Read and write using filesystem DAX to a file on a filesystem mounted with DAX on a persistent memory device through the NVML libpmemblk library.
> 
> dev-dax
> 
> Read and write using device DAX to a persistent memory device (e.g., /dev/dax0.0) through the NVML libpmem library.
> 
> libpmem
> 
> Read and write using mmap I/O to a file on a filesystem mounted with DAX on a persistent memory device through the NVML libpmem library.
> 
> ï¿½
> 
> dl560g10spmem01:/var/work/fio-3.2 # /usr/local/bin/fio --version
> 
> fio-3.2
> 
> dl560g10spmem01:/var/work/fio-3.2 # /usr/local/bin/fio --enghelp
> 
> Available IO engines:
> 
> ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ cpuio
> 
> ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ mmap
> 
> ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ sync
> 
> ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ psync
> 
> ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ vsync
> 
> ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ pvsync
> 
> ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ pvsync2
> 
> ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ null
> 
> ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ net
> 
> ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ netsplice
> 
> ï¿½ï¿½ï¿½ï¿½ï¿½ ï¿½ï¿½ftruncate
> 
> ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ filecreate
> 
> ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ posixaio
> 
> ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ falloc
> 
> ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ e4defrag
> 
> ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ splice
> 
> ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ mtd
> 
> ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ sg
> 
> ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ binject
> 
> dl560g10spmem01:/var/work/fio-3.2 # ll /dev|grep dax
> 
> crw-------ï¿½ 1 root rootï¿½ï¿½ï¿½ 243,ï¿½ï¿½ 0 Nov 25 00:12 dax0.0
> 
> dl560g10spmem01:/var/work/fio-3.2 # ll /dev|grep pmem
> 
> brw-rw----ï¿½ 1 root diskï¿½ï¿½ï¿½ 259,ï¿½ï¿½ 2 Nov 25 00:35 pmem1
> 
> brw-rw----ï¿½ 1 root diskï¿½ï¿½ï¿½ 259,ï¿½ï¿½ 0 Nov 25 00:12 pmem2
> 
> brw-rw----ï¿½ 1 root diskï¿½ï¿½ï¿½ 259,ï¿½ï¿½ 1 Nov 25 00:12 pmem3
> 
> dl560g10spmem01:/var/work/fio-3.2 # /usr/local/bin/fio 
> --filename=/dev/dax0.0ï¿½ --direct=1 --rw=randrw --refill_buffers 
> --norandommap --randrepeat=0 --ioengine=dev-dax 
> --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 
> --iodepth=1 --numjobs=16 --runtime=180 --group_reporting 
> --name=4-rand-rw-3xx
> 
> fio: engine dev-dax not loadable
> 
> fio: failed to load engine
> 
> dl560g10spmem01:/var/work/fio-3.2 #
> 
> ï¿½
> 
> I can't access to the 3 persistent memory engines above.
> 
> ï¿½
> 
> May you please help me with loading these engines.
> 
> ï¿½
> 
> ï¿½
> 
> *Anton Gavriliuk*
> 
> Critical Support
> 
> Performance Engineering
> 


--
Jens Axboe


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: fio 3.2
  2017-11-26  5:30   ` fio 3.2 Gavriliuk, Anton (HPS Ukraine)
@ 2017-11-26 15:28     ` Sitsofe Wheeler
  2017-11-27 12:39       ` Gavriliuk, Anton (HPS Ukraine)
  0 siblings, 1 reply; 39+ messages in thread
From: Sitsofe Wheeler @ 2017-11-26 15:28 UTC (permalink / raw)
  To: Gavriliuk, Anton (HPS Ukraine); +Cc: fio@vger.kernel.org

On 26 November 2017 at 05:30, Gavriliuk, Anton (HPS Ukraine)
<anton.gavriliuk@hpe.com> wrote:
> OK, I rebuild 3.2 with libaio & libpmem libraries,
>
> dl560g10spmem01:/var/work/fio-3.2 # rpm -qa|grep libpmem
> libpmemblk-1.3-3.fc27.x86_64
> libpmempool-devel-1.3-3.fc27.

[snip]

> libpmem                       yes
> libpmemblk                    yes
> NVML pmemblk engine           yes
> NVML dev-dax engine           yes

OK so fio has been built with pmem support.

[...]

> dl560g10spmem01:/var/work/fio-3.2 # /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=dev-dax --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=1 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --size=300Gb --name=4-rand-rw-3xx
> 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=dev-dax, iodepth=1
> 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=dev-dax, iodepth=1
> fio-3.2
> Starting 2 processes
>
> Run status group 0 (all jobs):

Where's the output of this job? Did it work?

> dl560g10spmem01:/var/work/fio-3.2 #
>
> /dev/pmem1 on /pmem type xfs (rw,relatime,attr2,dax,inode64,noquota)
^^^ So your dax filesystem appears at /pmem ...

> dl560g10spmem01:/var/work/fio-3.2 # /usr/local/bin/fio --filename=/dev/pmem1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=pmemblk --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=1 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --allow_mounted_write=1 --thread=1 --direct=1
> 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=pmemblk, iodepth=1
> fio-3.2
> Starting 1 thread
> pmemblk: unable to open pmemblk pool file /dev/pmem1 (Invalid argument)

...but you're using filename=/dev/pmem1 with pmemblk? At least the
filename option you used looks incorrect for this ioengine - have you
seen the comments and filenames in the example over on
https://github.com/axboe/fio/blob/fio-3.2/examples/pmemblk.fio ? Also
are you sure you want to be setting allow_mounted_write=1 in this case
- what should be mounted where the I/O happens - won't pmemblk create
the "device" itself?

-- 
Sitsofe | http://sucs.org/~sits/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: fio 3.2
  2017-11-26 15:28     ` Sitsofe Wheeler
@ 2017-11-27 12:39       ` Gavriliuk, Anton (HPS Ukraine)
  2017-11-27 19:38         ` Sitsofe Wheeler
  0 siblings, 1 reply; 39+ messages in thread
From: Gavriliuk, Anton (HPS Ukraine) @ 2017-11-27 12:39 UTC (permalink / raw)
  To: Sitsofe Wheeler; +Cc: fio@vger.kernel.org

I'm trying to test PMEM using DAX/mmap(), but there're no outputs & errors,

dl560g10spmem01:~/FIO-PMEM_TestScripts/DAX # ll /dev|grep pmem
brw-rw----  1 root disk    259,   0 Nov 27 05:50 pmem1
brw-rw----  1 root disk    259,   1 Nov 27 05:50 pmem2
brw-rw----  1 root disk    259,   2 Nov 27 05:50 pmem3
dl560g10spmem01:~/FIO-PMEM_TestScripts/DAX # ll /dev|grep dax
crw-------  1 root root    243,   0 Nov 27 05:50 dax0.0
dl560g10spmem01:~/FIO-PMEM_TestScripts/DAX # ll
total 36
-rwxrwxrwx 1 root root 110 Nov 23 20:44 cleanupDAX.sh
-rwxrwxrwx 1 root root 381 Nov 27 05:55 run-fio-db-64krw.sh
-rwxr-xr-x 1 root root 361 Nov 23 20:51 run-fio-db-64ksw-numaopt.sh
-rwxrwxrwx 1 root root 361 Nov 23 20:47 run-fio-db-64ksw.sh
-rwxrwxrwx 1 root root 375 Nov 23 20:49 run-fio-db-8krr.sh
-rwxrwxrwx 1 root root 636 Nov 23 20:49 run-fio-db-8krrw.sh
-rwxrwxrwx 1 root root 339 Nov 23 20:45 run-itDAX.sh
-rwxrwxrwx 1 root root 132 Nov 23 20:46 setupDAX.sh
-rw-r--r-- 1 root root 117 Nov 23 20:16 testDAX.out
dl560g10spmem01:~/FIO-PMEM_TestScripts/DAX #
dl560g10spmem01:~/FIO-PMEM_TestScripts/DAX # cat run-fio-db-64krw.sh
#! /bin/sh

echo "Running random write 64k 32 IO 1024m file 8 threads on DAX PMEM"
result=$(fio --name=random-writers --ioengine=mmap --iodepth=32 --rw=randwrite --bs=64k --size=1024m --numjobs=8 --group_reporting=1 --eta=never --time_based --runtime=60 --filename=/dev/dax0.0 | grep WRITE)
agg=$(echo $result|cut -d',' -f2)
speed="${agg/aggrb=/Write Speed:}"
echo $speed
echo ""

dl560g10spmem01:~/FIO-PMEM_TestScripts/DAX # ./run-fio-db-64krw.sh
Running random write 64k 32 IO 1024m file 8 threads on DAX PMEM










dl560g10spmem01:~/FIO-PMEM_TestScripts/DAX # ./run-fio-db-64ksw-numaopt.sh
Running seq write 64k 1 IO 3g file 4 threads on DAX PMEM






dl560g10spmem01:~/FIO-PMEM_TestScripts/DAX # ./run-fio-db-64ksw.sh
Running seq write 64k 1 IO 3g file 4 threads on DAX PMEM






dl560g10spmem01:~/FIO-PMEM_TestScripts/DAX # ./run-fio-db-8krr.sh
Running random read 8k 32 IO 1024m file 8 threads on DAX PMEM










dl560g10spmem01:~/FIO-PMEM_TestScripts/DAX # ./run-fio-db-8krrw.sh
Running random read/write 60/40 8k 32 IO 1024m file 8 threads on DAX PMEM











dl560g10spmem01:~/FIO-PMEM_TestScripts/DAX #

-----Original Message-----
From: Sitsofe Wheeler [mailto:sitsofe@gmail.com] 
Sent: Sunday, November 26, 2017 5:28 PM
To: Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com>
Cc: fio@vger.kernel.org
Subject: Re: fio 3.2

On 26 November 2017 at 05:30, Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com> wrote:
> OK, I rebuild 3.2 with libaio & libpmem libraries,
>
> dl560g10spmem01:/var/work/fio-3.2 # rpm -qa|grep libpmem
> libpmemblk-1.3-3.fc27.x86_64
> libpmempool-devel-1.3-3.fc27.

[snip]

> libpmem                       yes
> libpmemblk                    yes
> NVML pmemblk engine           yes
> NVML dev-dax engine           yes

OK so fio has been built with pmem support.

[...]

> dl560g10spmem01:/var/work/fio-3.2 # /usr/local/bin/fio 
> --filename=/dev/dax0.0 --rw=randrw --refill_buffers --norandommap 
> --randrepeat=0 --ioengine=dev-dax 
> --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 
> --iodepth=1 --numjobs=1 --runtime=1800 --group_reporting 
> --name=4-rand-rw-3xx --size=300Gb --name=4-rand-rw-3xx
> 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 
> 4096B-256KiB, (T) 4096B-256KiB, ioengine=dev-dax, iodepth=1
> 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 
> 4096B-256KiB, (T) 4096B-256KiB, ioengine=dev-dax, iodepth=1
> fio-3.2
> Starting 2 processes
>
> Run status group 0 (all jobs):

Where's the output of this job? Did it work?

> dl560g10spmem01:/var/work/fio-3.2 #
>
> /dev/pmem1 on /pmem type xfs (rw,relatime,attr2,dax,inode64,noquota)
^^^ So your dax filesystem appears at /pmem ...

> dl560g10spmem01:/var/work/fio-3.2 # /usr/local/bin/fio 
> --filename=/dev/pmem1 --rw=randrw --refill_buffers --norandommap 
> --randrepeat=0 --ioengine=pmemblk 
> --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 
> --iodepth=1 --numjobs=1 --runtime=1800 --group_reporting 
> --name=4-rand-rw-3xx --allow_mounted_write=1 --thread=1 --direct=1
> 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 
> 4096B-256KiB, (T) 4096B-256KiB, ioengine=pmemblk, iodepth=1
> fio-3.2
> Starting 1 thread
> pmemblk: unable to open pmemblk pool file /dev/pmem1 (Invalid 
> argument)

...but you're using filename=/dev/pmem1 with pmemblk? At least the filename option you used looks incorrect for this ioengine - have you seen the comments and filenames in the example over on https://github.com/axboe/fio/blob/fio-3.2/examples/pmemblk.fio ? Also are you sure you want to be setting allow_mounted_write=1 in this case
- what should be mounted where the I/O happens - won't pmemblk create the "device" itself?

--
Sitsofe | http://sucs.org/~sits/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: fio 3.2
  2017-11-27 12:39       ` Gavriliuk, Anton (HPS Ukraine)
@ 2017-11-27 19:38         ` Sitsofe Wheeler
  2017-11-27 22:48           ` Rebecca Cran
                             ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Sitsofe Wheeler @ 2017-11-27 19:38 UTC (permalink / raw)
  To: Gavriliuk, Anton (HPS Ukraine)
  Cc: fio@vger.kernel.org, Brian Boylston, Robert Elliott

Unfortunately I don't have access to a pmem device but let's see how far we get:

On 27 November 2017 at 12:39, Gavriliuk, Anton (HPS Ukraine)
<anton.gavriliuk@hpe.com> wrote:
>
> result=$(fio --name=random-writers --ioengine=mmap --iodepth=32 --rw=randwrite --bs=64k --size=1024m --numjobs=8 --group_reporting=1 --eta=never --time_based --runtime=60 --filename=/dev/dax0.0 | grep WRITE)

Please make your problem scenarios as simple as possible:
1. Just run fio normally so we can see the output it produces on both
stdout and stderr normally.
2. Reduce the job back so it there are only the bare minimum options
that reproduce the problem.
3. Try to avoid changing lots of things.

Here you've switched ioengine introducing another place to look.
Instead how about this:

Was your /dev/dax0.0 device made using ndctl? Assuming yes:

fio --name=dax-mmap --ioengine=mmap --rw=write --bs=64k --eta=never
--time_based --runtime=60 --filename=/dev/dax0.0 --size=2g

(apparently a size has to be specified when you try to use a character
device - see https://nvdimm.wiki.kernel.org/ )

If you run just that by itself what do you see?

Next up:

fio --name=dax-dax --ioengine=dev-dax --rw=write --bs=64k --eta=never
--time_based --runtime=60 --filename=/dev/dax0.0 --size=2g

If you run just that by itself what do you see?

Finally:
Assuming a -o dax mounted filesystem on /pmem0/ :

fio --name=libpmemblk --ioengine=pmemblk --rw=write --bs=64k
--eta=never --time_based --runtime=60
--filename=/pmem0/fio-test,4096,1024 --thread=1

If you run just that by itself what do you see?

(Perhaps the documentation for these ioengines and pmem devices needs
to be improved?)

-- 
Sitsofe | http://sucs.org/~sits/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: fio 3.2
  2017-11-27 19:38         ` Sitsofe Wheeler
@ 2017-11-27 22:48           ` Rebecca Cran
  2017-11-28  1:12             ` Gavriliuk, Anton (HPS Ukraine)
       [not found]           ` <F5206176-8A92-46B4-9092-F901C132E22C@bluestop.org>
  2017-11-28 14:25           ` Gavriliuk, Anton (HPS Ukraine)
  2 siblings, 1 reply; 39+ messages in thread
From: Rebecca Cran @ 2017-11-27 22:48 UTC (permalink / raw)
  To: Sitsofe Wheeler
  Cc: Gavriliuk, Anton (HPS Ukraine), fio@vger.kernel.org,
	Brian Boylston, Robert Elliott

If you don’t have a pmem device, Linux can emulate one using RAM and the 
‘memmap’ kernel parameter. See 
http://pmem.io/2016/02/22/pm-emulation.html for more info.

Rebecca

Sent from my iPhone

On Nov 27, 2017, at 12:38 PM, Sitsofe Wheeler <sitsofe@gmail.com 
<mailto:sitsofe@gmail.com>> wrote:

> Unfortunately I don't have access to a pmem device but let's see how 
> far we get:
>
> On 27 November 2017 at 12:39, Gavriliuk, Anton (HPS Ukraine)
> <anton.gavriliuk@hpe.com <mailto:anton.gavriliuk@hpe.com>> wrote:
>>
>> result=$(fio --name=random-writers --ioengine=mmap --iodepth=32 
>> --rw=randwrite --bs=64k --size=1024m --numjobs=8 --group_reporting=1 
>> --eta=never --time_based --runtime=60 --filename=/dev/dax0.0 | grep 
>> WRITE)
>
> Please make your problem scenarios as simple as possible:
> 1. Just run fio normally so we can see the output it produces on both
> stdout and stderr normally.
> 2. Reduce the job back so it there are only the bare minimum options
> that reproduce the problem.
> 3. Try to avoid changing lots of things.
>
> Here you've switched ioengine introducing another place to look.
> Instead how about this:
>
> Was your /dev/dax0.0 device made using ndctl? Assuming yes:
>
> fio --name=dax-mmap --ioengine=mmap --rw=write --bs=64k --eta=never
> --time_based --runtime=60 --filename=/dev/dax0.0 --size=2g
>
> (apparently a size has to be specified when you try to use a character
> device - see https://nvdimm.wiki.kernel.org/ )
>
> If you run just that by itself what do you see?
>
> Next up:
>
> fio --name=dax-dax --ioengine=dev-dax --rw=write --bs=64k --eta=never
> --time_based --runtime=60 --filename=/dev/dax0.0 --size=2g
>
> If you run just that by itself what do you see?
>
> Finally:
> Assuming a -o dax mounted filesystem on /pmem0/ :
>
> fio --name=libpmemblk --ioengine=pmemblk --rw=write --bs=64k
> --eta=never --time_based --runtime=60
> --filename=/pmem0/fio-test,4096,1024 --thread=1
>
> If you run just that by itself what do you see?
>
> (Perhaps the documentation for these ioengines and pmem devices needs
> to be improved?)
>
> -- 
> Sitsofe | http://sucs.org/~sits/ <http://sucs.org/%7Esits/>
> --
> To unsubscribe from this list: send the line "unsubscribe fio" in
> the body of a message to majordomo@vger.kernel.org 
> <mailto:majordomo@vger.kernel.org>
> More majordomo info at http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: fio 3.2
  2017-11-27 22:48           ` Rebecca Cran
@ 2017-11-28  1:12             ` Gavriliuk, Anton (HPS Ukraine)
  2017-11-28  4:51               ` Elliott, Robert (Persistent Memory)
  0 siblings, 1 reply; 39+ messages in thread
From: Gavriliuk, Anton (HPS Ukraine) @ 2017-11-28  1:12 UTC (permalink / raw)
  To: Rebecca Cran, Sitsofe Wheeler
  Cc: fio@vger.kernel.org, Brian Boylston,
	Elliott, Robert (Persistent Memory)

No, I have true 4 pmem devices right now, 300Gb each,

dl560g10spmem01:~ # ll /dev|grep dax
crw-------  1 root root    243,   0 Nov 27 05:50 dax0.0
dl560g10spmem01:~ # ll /dev|grep pmem
brw-rw----  1 root disk    259,   0 Nov 27 11:38 pmem1
brw-rw----  1 root disk    259,   1 Nov 27 12:39 pmem2
brw-rw----  1 root disk    259,   2 Nov 27 05:50 pmem3

Anton

-----Original Message-----
From: Rebecca Cran [mailto:rebecca@bluestop.org] 
Sent: Tuesday, November 28, 2017 12:48 AM
To: Sitsofe Wheeler <sitsofe@gmail.com>
Cc: Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com>; fio@vger.kernel.org; Brian Boylston <brian.boylston@hpe.com>; Elliott, Robert (Persistent Memory) <elliott@hpe.com>
Subject: Re: fio 3.2

If you don’t have a pmem device, Linux can emulate one using RAM and the ‘memmap’ kernel parameter. See http://pmem.io/2016/02/22/pm-emulation.html for more info.

Rebecca

Sent from my iPhone

On Nov 27, 2017, at 12:38 PM, Sitsofe Wheeler <sitsofe@gmail.com <mailto:sitsofe@gmail.com>> wrote:

> Unfortunately I don't have access to a pmem device but let's see how 
> far we get:
>
> On 27 November 2017 at 12:39, Gavriliuk, Anton (HPS Ukraine) 
> <anton.gavriliuk@hpe.com <mailto:anton.gavriliuk@hpe.com>> wrote:
>>
>> result=$(fio --name=random-writers --ioengine=mmap --iodepth=32 
>> --rw=randwrite --bs=64k --size=1024m --numjobs=8 --group_reporting=1 
>> --eta=never --time_based --runtime=60 --filename=/dev/dax0.0 | grep
>> WRITE)
>
> Please make your problem scenarios as simple as possible:
> 1. Just run fio normally so we can see the output it produces on both 
> stdout and stderr normally.
> 2. Reduce the job back so it there are only the bare minimum options 
> that reproduce the problem.
> 3. Try to avoid changing lots of things.
>
> Here you've switched ioengine introducing another place to look.
> Instead how about this:
>
> Was your /dev/dax0.0 device made using ndctl? Assuming yes:
>
> fio --name=dax-mmap --ioengine=mmap --rw=write --bs=64k --eta=never 
> --time_based --runtime=60 --filename=/dev/dax0.0 --size=2g
>
> (apparently a size has to be specified when you try to use a character 
> device - see https://nvdimm.wiki.kernel.org/ )
>
> If you run just that by itself what do you see?
>
> Next up:
>
> fio --name=dax-dax --ioengine=dev-dax --rw=write --bs=64k --eta=never 
> --time_based --runtime=60 --filename=/dev/dax0.0 --size=2g
>
> If you run just that by itself what do you see?
>
> Finally:
> Assuming a -o dax mounted filesystem on /pmem0/ :
>
> fio --name=libpmemblk --ioengine=pmemblk --rw=write --bs=64k 
> --eta=never --time_based --runtime=60
> --filename=/pmem0/fio-test,4096,1024 --thread=1
>
> If you run just that by itself what do you see?
>
> (Perhaps the documentation for these ioengines and pmem devices needs 
> to be improved?)
>
> --
> Sitsofe | http://sucs.org/~sits/ <http://sucs.org/%7Esits/>
> --
> To unsubscribe from this list: send the line "unsubscribe fio" in the 
> body of a message to majordomo@vger.kernel.org 
> <mailto:majordomo@vger.kernel.org>
> More majordomo info at http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: fio 3.2
  2017-11-28  1:12             ` Gavriliuk, Anton (HPS Ukraine)
@ 2017-11-28  4:51               ` Elliott, Robert (Persistent Memory)
  2017-11-29  3:35                 ` Gavriliuk, Anton (HPS Ukraine)
  0 siblings, 1 reply; 39+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2017-11-28  4:51 UTC (permalink / raw)
  To: Gavriliuk, Anton (HPS Ukraine), Rebecca Cran, Sitsofe Wheeler
  Cc: fio@vger.kernel.org, Kani, Toshimitsu

> -----Original Message-----
> From: Gavriliuk, Anton (HPS Ukraine)
> Sent: Monday, November 27, 2017 7:13 PM
> Subject: RE: fio 3.2
> 
> No, I have true 4 pmem devices right now, 300Gb each,

I think that suggestion was for Sitsofe.


> On Nov 27, 2017, at 12:38 PM, Sitsofe Wheeler <sitsofe@gmail.com
> <mailto:sitsofe@gmail.com>> wrote:
> 
> > Unfortunately I don't have access to a pmem device but let's see how
> > far we get:
> >
> > On 27 November 2017 at 12:39, Gavriliuk, Anton (HPS Ukraine)
> > <anton.gavriliuk@hpe.com <mailto:anton.gavriliuk@hpe.com>> wrote:
> >>
> >> result=$(fio --name=random-writers --ioengine=mmap --iodepth=32
> >> --rw=randwrite --bs=64k --size=1024m --numjobs=8 --group_reporting=1
> >> --eta=never --time_based --runtime=60 --filename=/dev/dax0.0 | grep
> >> WRITE)
...
> > Here you've switched ioengine introducing another place to look.
> > Instead how about this:
> >
...
> > (apparently a size has to be specified when you try to use a character
> > device - see https://nvdimm.wiki.kernel.org/ )
> >
...
> > (Perhaps the documentation for these ioengines and pmem devices needs
> > to be improved?)

There are several oddities with device DAX (/dev/dax0.0 character devices,
providing a window to persistent memory representing the whole device) compared
to filesystem DAX (mounting ext4 or xfs with -o dax and using mmap to access
files).

1. I recommend using allow_file_create=0 to ensure fio doesn't just create
a plain file called "/dev/dax0.0" on your regular storage device and do its
I/Os to that.


2. This script will convert 16 pmem devices into device DAX mode (trashing
all data in the process):

#!/bin/bash
for i in {0..15}
do
        echo working on $i
        # -M map puts struct page data in regular memory, not nvdimm memory
        # that's fine for NVDIMM-Ns; intended for larger capacities
        ndctl create-namespace -m dax -M mem -e namespace$i.0 -f
done

ndctl carves out some space for device DAX metadata. For 8 GiB of
persistent memory, the resulting character device sizes are:
    -M mem: 8587837440 bytes = 8190 MiB (8 GiB - 2 MiB)
    -M dev: 8453619712 bytes = 8062 MiB (8 GiB - 2 MiB - 128 MiB)

Since /dev/dax is a character device, it has no "size", so you must manually
tell it the size. We should patch fio to automatically detect that from
/sys/class/dax/dax0.0/size in linux.  I don't know how this would work in
Windows.


3. One possible issue is the mmap ioengine MMAP_TOTAL_SZ define "limits
us to 1GiB of mapped files in total".  

I expect real software using device DAX will want to map the entire device
with one mmap() call, then perform loads, stores, cache flushes, etc. -
things that the NVML libraries help it do correctly.  

As is, I think fio keeps mapping and unmapping, which exercises the kernel
more than the hardware.


4. The CPU instructions used by fio depend on the glibc library version.
As mentioned in an earlier fio thread, that changes a lot. With libc2.24.so,
random reads seem to be done with rep movsb.

norandommap, randrepeat=0, zero_buffers, and gtod_reduce=1 help reduce
fio overhead at these rates.

I don't think iodepth is used by the mmap ioengine.


5. If the blocksize * number of threads exceeds the CPU cache size, you'll start
generating lots of traffic to regular memory in addition to persistent memory.

Example: 36 threads generating 64 KiB reads from /dev/dax0.0 ends up reading
read 11 GB/s from persistent memory; the writes stay in CPU cache 
(no memory write traffic).

|---------------------------------------||---------------------------------------|
|--             Socket  0             --||--             Socket  1             --|
|---------------------------------------||---------------------------------------|
|--     Memory Channel Monitoring     --||--     Memory Channel Monitoring     --|
|---------------------------------------||---------------------------------------|
|-- Mem Ch  0: Reads (MB/s): 11022.33 --||-- Mem Ch  0: Reads (MB/s):    15.60 --|
|--            Writes(MB/s):    22.29 --||--            Writes(MB/s):     7.09 --|
|-- Mem Ch  1: Reads (MB/s):    56.72 --||-- Mem Ch  1: Reads (MB/s):     9.34 --|
|--            Writes(MB/s):    17.03 --||--            Writes(MB/s):     0.81 --|
|-- Mem Ch  4: Reads (MB/s):    60.12 --||-- Mem Ch  4: Reads (MB/s):    15.79 --|
|--            Writes(MB/s):    23.66 --||--            Writes(MB/s):     7.10 --|
|-- Mem Ch  5: Reads (MB/s):    54.94 --||-- Mem Ch  5: Reads (MB/s):     9.65 --|
|--            Writes(MB/s):    17.30 --||--            Writes(MB/s):     1.02 --|
|-- NODE 0 Mem Read (MB/s) : 11194.11 --||-- NODE 1 Mem Read (MB/s) :    50.37 --|
|-- NODE 0 Mem Write(MB/s) :    80.28 --||-- NODE 1 Mem Write(MB/s) :    16.02 --|
|-- NODE 0 P. Write (T/s):     192832 --||-- NODE 1 P. Write (T/s):     190615 --|
|-- NODE 0 Memory (MB/s):    11274.38 --||-- NODE 1 Memory (MB/s):       66.39 --|
|---------------------------------------||---------------------------------------|
|---------------------------------------||---------------------------------------|
|--                   System Read Throughput(MB/s):  11332.59                  --|
|--                  System Write Throughput(MB/s):     96.23                  --|
|--                 System Memory Throughput(MB/s):  11428.82                  --|
|---------------------------------------||---------------------------------------|


Increasing to 2 MiB reads, the bandwidth from persistent memory drops to 9 GB/s,
most of those turn into writes to regular memory (with some reads as well as
the caches thrash).

|---------------------------------------||---------------------------------------|
|--             Socket  0             --||--             Socket  1             --|
|---------------------------------------||---------------------------------------|
|--     Memory Channel Monitoring     --||--     Memory Channel Monitoring     --|
|---------------------------------------||---------------------------------------|
|-- Mem Ch  0: Reads (MB/s):  9069.96 --||-- Mem Ch  0: Reads (MB/s):    18.33 --|
|--            Writes(MB/s):  2026.71 --||--            Writes(MB/s):     8.44 --|
|-- Mem Ch  1: Reads (MB/s):  1070.84 --||-- Mem Ch  1: Reads (MB/s):    11.08 --|
|--            Writes(MB/s):  2021.61 --||--            Writes(MB/s):     2.11 --|
|-- Mem Ch  4: Reads (MB/s):  1069.20 --||-- Mem Ch  4: Reads (MB/s):    17.85 --|
|--            Writes(MB/s):  2028.27 --||--            Writes(MB/s):     8.51 --|
|-- Mem Ch  5: Reads (MB/s):  1062.81 --||-- Mem Ch  5: Reads (MB/s):    11.59 --|
|--            Writes(MB/s):  2021.74 --||--            Writes(MB/s):     2.31 --|
|-- NODE 0 Mem Read (MB/s) : 12272.82 --||-- NODE 1 Mem Read (MB/s) :    58.85 --|
|-- NODE 0 Mem Write(MB/s) :  8098.33 --||-- NODE 1 Mem Write(MB/s) :    21.36 --|
|-- NODE 0 P. Write (T/s):     220528 --||-- NODE 1 P. Write (T/s):     190643 --|
|-- NODE 0 Memory (MB/s):    20371.14 --||-- NODE 1 Memory (MB/s):       80.21 --|
|---------------------------------------||---------------------------------------|
|---------------------------------------||---------------------------------------|
|--                   System Read Throughput(MB/s):  12331.67                  --|
|--                  System Write Throughput(MB/s):   8119.69                  --|
|--                 System Memory Throughput(MB/s):  20451.36                  --|
|---------------------------------------||---------------------------------------|


Example simple script:

[random-test]
kb_base=1000
ioengine=mmap
iodepth=1
rw=randread
bs=64KiB
size=7GiB
numjobs=36
cpus_allowed_policy=split
cpus_allowed=0-17,36-53
group_reporting=1
time_based
runtime=60
allow_file_create=0
filename=/dev/dax0.0


Example more aggressive script:
[global]
kb_base=1000
ioengine=mmap
iodepth=1
rw=randread
bs=64KiB
ba=64KiB
numjobs=36
cpus_allowed_policy=split
cpus_allowed=0-17,36-53
group_reporting=1
norandommap
randrepeat=0
zero_buffers
gtod_reduce=1
time_based
runtime=60999
allow_file_create=0

[d0]
size=7GiB
filename=/dev/dax0.0

[d1]
size=7GiB
filename=/dev/dax1.0

[d2]
size=7GiB
filename=/dev/dax2.0

[d3]
size=7GiB
filename=/dev/dax3.0

[d4]
size=7GiB
filename=/dev/dax4.0

[d5]
size=7GiB
filename=/dev/dax5.0

[d6]
size=7GiB
filename=/dev/dax6.0

[d7]
size=7GiB
filename=/dev/dax7.0




^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: fio 3.2
  2017-11-28  4:51               ` Elliott, Robert (Persistent Memory)
@ 2017-11-29  3:35                 ` Gavriliuk, Anton (HPS Ukraine)
  2017-11-29  4:13                   ` Elliott, Robert (Persistent Memory)
  0 siblings, 1 reply; 39+ messages in thread
From: Gavriliuk, Anton (HPS Ukraine) @ 2017-11-29  3:35 UTC (permalink / raw)
  To: Elliott, Robert (Persistent Memory), Rebecca Cran,
	Sitsofe Wheeler
  Cc: fio@vger.kernel.org, Kani, Toshimitsu

Sorry Elliott, seems to be I missed your mail.

1. I recommend using allow_file_create=0 to ensure fio doesn't just create a plain file called "/dev/dax0.0" on your regular storage device and do its I/Os to that.

Done, no visible improvements detected,

dl560g10spmem01:~ # /usr/bin/fio --name=4-rand-rw-3xx --ioengine=mmap --iodepth=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --size=290g --numjobs=16 --group_reporting --runtime=120 --filename=/dev/dax0.0 --allow_file_create=0
4-rand-rw-3xx: (g=0): rw=randrw, bs=4K-256K/4K-256K/4K-256K, ioengine=mmap, iodepth=1
...
fio-2.12
Starting 16 processes
Jobs: 14 (f=14): [m(10),_(1),m(4),_(1)] [78.1% done] [1569MB/29903MB/0KB /s] [30.1K/588K/0 iops] [eta 00m:34s]
4-rand-rw-3xx: (groupid=0, jobs=16): err= 0: pid=18988: Tue Nov 28 21:30:12 2017
  read : io=204940MB, bw=1707.9MB/s, iops=33575, runt=120001msec
    clat (usec): min=0, max=1008, avg= 8.85, stdev= 5.25
     lat (usec): min=0, max=1008, avg= 8.89, stdev= 5.25
    clat percentiles (usec):
     |  1.00th=[    1],  5.00th=[    1], 10.00th=[    2], 20.00th=[    4],
     | 30.00th=[    6], 40.00th=[    8], 50.00th=[    9], 60.00th=[   11],
     | 70.00th=[   11], 80.00th=[   12], 90.00th=[   13], 95.00th=[   14],
     | 99.00th=[   29], 99.50th=[   43], 99.90th=[   48], 99.95th=[   49],
     | 99.99th=[   53]
    bw (KB  /s): min=68034, max=158528, per=6.25%, avg=109348.46, stdev=19939.95
  write: io=3798.1GB, bw=32417MB/s, iops=637449, runt=120001msec
    clat (usec): min=0, max=616, avg=15.13, stdev=10.08
     lat (usec): min=0, max=717, avg=15.17, stdev=10.08
    clat percentiles (usec):
     |  1.00th=[    1],  5.00th=[    2], 10.00th=[    3], 20.00th=[    6],
     | 30.00th=[   11], 40.00th=[   12], 50.00th=[   12], 60.00th=[   20],
     | 70.00th=[   21], 80.00th=[   22], 90.00th=[   23], 95.00th=[   24],
     | 99.00th=[   46], 99.50th=[   86], 99.90th=[   91], 99.95th=[   92],
     | 99.99th=[   96]
    bw (MB  /s): min= 1359, max= 2700, per=6.25%, avg=2026.98, stdev=363.84
    lat (usec) : 2=4.70%, 4=8.45%, 10=11.33%, 20=32.66%, 50=42.21%
    lat (usec) : 100=0.65%, 250=0.01%, 500=0.01%, 750=0.01%
    lat (msec) : 2=0.01%
  cpu          : usr=99.74%, sys=0.21%, ctx=2538, majf=0, minf=2378756
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=4029100/w=76494589/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: io=204940MB, aggrb=1707.9MB/s, minb=1707.9MB/s, maxb=1707.9MB/s, mint=120001msec, maxt=120001msec
  WRITE: io=3798.1GB, aggrb=32417MB/s, minb=32417MB/s, maxb=32417MB/s, mint=120001msec, maxt=120001msec


    HARDCLOCK entries
       Count     Pct  State  Function
        1955  64.97%  USER   __memcpy_avx_unaligned [/lib64/libc-2.22.so]
        1037  34.46%  USER   UNKNOWN
           6   0.20%  SYS    do_page_fault
           5   0.17%  SYS    find_next_iomem_res
           3   0.10%  SYS    pagerange_is_ram_callback
           1   0.03%  SYS    page_add_new_anon_rmap
           1   0.03%  SYS    lookup_memtype
           1   0.03%  SYS    vmf_insert_pfn_pmd

       Count     Pct  HARDCLOCK Stack trace
       ============================================================
           4   0.13%  do_page_fault  page_fault  unknown  |  __memcpy_avx_unaligned
           3   0.10%  find_next_iomem_res  pagerange_is_ram_callback  walk_system_ram_range  pat_pagerange_is_ram  lookup_memtype  track_pfn_insert  vmf_insert_pfn_pmd  dax_dev_pmd_fault  handle_mm_fault  __do_page_fault  do_page_fault  page_fault  unknown  |  __memcpy_avx_unaligned
           2   0.07%  pagerange_is_ram_callback  walk_system_ram_range  pat_pagerange_is_ram  lookup_memtype  track_pfn_insert  vmf_insert_pfn_pmd  dax_dev_pmd_fault  handle_mm_fault  __do_page_fault  do_page_fault  page_fault  unknown  |  __memcpy_avx_unaligned



3. One possible issue is the mmap ioengine MMAP_TOTAL_SZ define "limits us to 1GiB of mapped files in total".  

I expect real software using device DAX will want to map the entire device with one mmap() call, then perform loads, stores, cache flushes, etc. - things that the NVML libraries help it do correctly.  

As is, I think fio keeps mapping and unmapping, which exercises the kernel more than the hardware.

Yes, fio maps entire device by 1 mmap(),

    ******** SYSTEM CALL REPORT ********
    System Call Name     Count     Rate     ElpTime        Avg        Max    Errs    AvSz     KB/s
    futex                    7      0.2    0.101176   0.014454   0.101147       1
       SLEEP                 2      0.0    0.101148   0.050574
          Sleep Func         2             0.101148   0.050574   0.101140  futex_wait_queue_me
       RUNQ                                0.000004
       CPU                                 0.000024
    open                     2      0.0    0.000132   0.000066   0.000130       0
    read                     1      0.0    0.000009   0.000009   0.000009       0       3      0.0
    shmat                    1      0.0    0.000006   0.000006   0.000006       0
    mmap                     1      0.0    0.000003   0.000003   0.000003       0


4. The CPU instructions used by fio depend on the glibc library version.
As mentioned in an earlier fio thread, that changes a lot. With libc2.24.so, random reads seem to be done with rep movsb.

The test box SLES12 SP3 has glibc2.22, so we have to update. But I can't understand what does it mean - " random reads seem to be done with rep movsb " 


-----Original Message-----
From: Elliott, Robert (Persistent Memory) 
Sent: Tuesday, November 28, 2017 6:52 AM
To: Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com>; Rebecca Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com>
Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com>
Subject: RE: fio 3.2

> -----Original Message-----
> From: Gavriliuk, Anton (HPS Ukraine)
> Sent: Monday, November 27, 2017 7:13 PM
> Subject: RE: fio 3.2
> 
> No, I have true 4 pmem devices right now, 300Gb each,

I think that suggestion was for Sitsofe.


> On Nov 27, 2017, at 12:38 PM, Sitsofe Wheeler <sitsofe@gmail.com 
> <mailto:sitsofe@gmail.com>> wrote:
> 
> > Unfortunately I don't have access to a pmem device but let's see how 
> > far we get:
> >
> > On 27 November 2017 at 12:39, Gavriliuk, Anton (HPS Ukraine) 
> > <anton.gavriliuk@hpe.com <mailto:anton.gavriliuk@hpe.com>> wrote:
> >>
> >> result=$(fio --name=random-writers --ioengine=mmap --iodepth=32 
> >> --rw=randwrite --bs=64k --size=1024m --numjobs=8 
> >> --group_reporting=1 --eta=never --time_based --runtime=60 
> >> --filename=/dev/dax0.0 | grep
> >> WRITE)
...
> > Here you've switched ioengine introducing another place to look.
> > Instead how about this:
> >
...
> > (apparently a size has to be specified when you try to use a 
> > character device - see https://nvdimm.wiki.kernel.org/ )
> >
...
> > (Perhaps the documentation for these ioengines and pmem devices 
> > needs to be improved?)

There are several oddities with device DAX (/dev/dax0.0 character devices, providing a window to persistent memory representing the whole device) compared to filesystem DAX (mounting ext4 or xfs with -o dax and using mmap to access files).

1. I recommend using allow_file_create=0 to ensure fio doesn't just create a plain file called "/dev/dax0.0" on your regular storage device and do its I/Os to that.


2. This script will convert 16 pmem devices into device DAX mode (trashing all data in the process):

#!/bin/bash
for i in {0..15}
do
        echo working on $i
        # -M map puts struct page data in regular memory, not nvdimm memory
        # that's fine for NVDIMM-Ns; intended for larger capacities
        ndctl create-namespace -m dax -M mem -e namespace$i.0 -f done

ndctl carves out some space for device DAX metadata. For 8 GiB of persistent memory, the resulting character device sizes are:
    -M mem: 8587837440 bytes = 8190 MiB (8 GiB - 2 MiB)
    -M dev: 8453619712 bytes = 8062 MiB (8 GiB - 2 MiB - 128 MiB)

Since /dev/dax is a character device, it has no "size", so you must manually tell it the size. We should patch fio to automatically detect that from /sys/class/dax/dax0.0/size in linux.  I don't know how this would work in Windows.


3. One possible issue is the mmap ioengine MMAP_TOTAL_SZ define "limits us to 1GiB of mapped files in total".  

I expect real software using device DAX will want to map the entire device with one mmap() call, then perform loads, stores, cache flushes, etc. - things that the NVML libraries help it do correctly.  

As is, I think fio keeps mapping and unmapping, which exercises the kernel more than the hardware.


4. The CPU instructions used by fio depend on the glibc library version.
As mentioned in an earlier fio thread, that changes a lot. With libc2.24.so, random reads seem to be done with rep movsb.

norandommap, randrepeat=0, zero_buffers, and gtod_reduce=1 help reduce fio overhead at these rates.

I don't think iodepth is used by the mmap ioengine.


5. If the blocksize * number of threads exceeds the CPU cache size, you'll start generating lots of traffic to regular memory in addition to persistent memory.

Example: 36 threads generating 64 KiB reads from /dev/dax0.0 ends up reading read 11 GB/s from persistent memory; the writes stay in CPU cache (no memory write traffic).

|---------------------------------------||---------------------------------------|
|--             Socket  0             --||--             Socket  1             --|
|---------------------------------------||---------------------------------------|
|--     Memory Channel Monitoring     --||--     Memory Channel Monitoring     --|
|---------------------------------------||---------------------------------------|
|-- Mem Ch  0: Reads (MB/s): 11022.33 --||-- Mem Ch  0: Reads (MB/s):    15.60 --|
|--            Writes(MB/s):    22.29 --||--            Writes(MB/s):     7.09 --|
|-- Mem Ch  1: Reads (MB/s):    56.72 --||-- Mem Ch  1: Reads (MB/s):     9.34 --|
|--            Writes(MB/s):    17.03 --||--            Writes(MB/s):     0.81 --|
|-- Mem Ch  4: Reads (MB/s):    60.12 --||-- Mem Ch  4: Reads (MB/s):    15.79 --|
|--            Writes(MB/s):    23.66 --||--            Writes(MB/s):     7.10 --|
|-- Mem Ch  5: Reads (MB/s):    54.94 --||-- Mem Ch  5: Reads (MB/s):     9.65 --|
|--            Writes(MB/s):    17.30 --||--            Writes(MB/s):     1.02 --|
|-- NODE 0 Mem Read (MB/s) : 11194.11 --||-- NODE 1 Mem Read (MB/s) :    50.37 --|
|-- NODE 0 Mem Write(MB/s) :    80.28 --||-- NODE 1 Mem Write(MB/s) :    16.02 --|
|-- NODE 0 P. Write (T/s):     192832 --||-- NODE 1 P. Write (T/s):     190615 --|
|-- NODE 0 Memory (MB/s):    11274.38 --||-- NODE 1 Memory (MB/s):       66.39 --|
|---------------------------------------||---------------------------------------|
|---------------------------------------||---------------------------------------|
|--                   System Read Throughput(MB/s):  11332.59                  --|
|--                  System Write Throughput(MB/s):     96.23                  --|
|--                 System Memory Throughput(MB/s):  11428.82                  --|
|---------------------------------------||---------------------------------------|


Increasing to 2 MiB reads, the bandwidth from persistent memory drops to 9 GB/s, most of those turn into writes to regular memory (with some reads as well as the caches thrash).

|---------------------------------------||---------------------------------------|
|--             Socket  0             --||--             Socket  1             --|
|---------------------------------------||---------------------------------------|
|--     Memory Channel Monitoring     --||--     Memory Channel Monitoring     --|
|---------------------------------------||---------------------------------------|
|-- Mem Ch  0: Reads (MB/s):  9069.96 --||-- Mem Ch  0: Reads (MB/s):    18.33 --|
|--            Writes(MB/s):  2026.71 --||--            Writes(MB/s):     8.44 --|
|-- Mem Ch  1: Reads (MB/s):  1070.84 --||-- Mem Ch  1: Reads (MB/s):    11.08 --|
|--            Writes(MB/s):  2021.61 --||--            Writes(MB/s):     2.11 --|
|-- Mem Ch  4: Reads (MB/s):  1069.20 --||-- Mem Ch  4: Reads (MB/s):    17.85 --|
|--            Writes(MB/s):  2028.27 --||--            Writes(MB/s):     8.51 --|
|-- Mem Ch  5: Reads (MB/s):  1062.81 --||-- Mem Ch  5: Reads (MB/s):    11.59 --|
|--            Writes(MB/s):  2021.74 --||--            Writes(MB/s):     2.31 --|
|-- NODE 0 Mem Read (MB/s) : 12272.82 --||-- NODE 1 Mem Read (MB/s) :    58.85 --|
|-- NODE 0 Mem Write(MB/s) :  8098.33 --||-- NODE 1 Mem Write(MB/s) :    21.36 --|
|-- NODE 0 P. Write (T/s):     220528 --||-- NODE 1 P. Write (T/s):     190643 --|
|-- NODE 0 Memory (MB/s):    20371.14 --||-- NODE 1 Memory (MB/s):       80.21 --|
|---------------------------------------||---------------------------------------|
|---------------------------------------||---------------------------------------|
|--                   System Read Throughput(MB/s):  12331.67                  --|
|--                  System Write Throughput(MB/s):   8119.69                  --|
|--                 System Memory Throughput(MB/s):  20451.36                  --|
|---------------------------------------||---------------------------------------|


Example simple script:

[random-test]
kb_base=1000
ioengine=mmap
iodepth=1
rw=randread
bs=64KiB
size=7GiB
numjobs=36
cpus_allowed_policy=split
cpus_allowed=0-17,36-53
group_reporting=1
time_based
runtime=60
allow_file_create=0
filename=/dev/dax0.0


Example more aggressive script:
[global]
kb_base=1000
ioengine=mmap
iodepth=1
rw=randread
bs=64KiB
ba=64KiB
numjobs=36
cpus_allowed_policy=split
cpus_allowed=0-17,36-53
group_reporting=1
norandommap
randrepeat=0
zero_buffers
gtod_reduce=1
time_based
runtime=60999
allow_file_create=0

[d0]
size=7GiB
filename=/dev/dax0.0

[d1]
size=7GiB
filename=/dev/dax1.0

[d2]
size=7GiB
filename=/dev/dax2.0

[d3]
size=7GiB
filename=/dev/dax3.0

[d4]
size=7GiB
filename=/dev/dax4.0

[d5]
size=7GiB
filename=/dev/dax5.0

[d6]
size=7GiB
filename=/dev/dax6.0

[d7]
size=7GiB
filename=/dev/dax7.0




^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: fio 3.2
  2017-11-29  3:35                 ` Gavriliuk, Anton (HPS Ukraine)
@ 2017-11-29  4:13                   ` Elliott, Robert (Persistent Memory)
  2017-11-29  4:44                     ` Rebecca Cran
                                       ` (3 more replies)
  0 siblings, 4 replies; 39+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2017-11-29  4:13 UTC (permalink / raw)
  To: Gavriliuk, Anton (HPS Ukraine), Rebecca Cran, Sitsofe Wheeler
  Cc: fio@vger.kernel.org, Kani, Toshimitsu


> -----Original Message-----
> From: Gavriliuk, Anton (HPS Ukraine)
> Sent: Tuesday, November 28, 2017 9:35 PM
> To: Elliott, Robert (Persistent Memory) <elliott@hpe.com>; Rebecca Cran
> <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com>
> Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com>
> Subject: RE: fio 3.2


> 4. The CPU instructions used by fio depend on the glibc library version.
> As mentioned in an earlier fio thread, that changes a lot. With
> libc2.24.so, random reads seem to be done with rep movsb.
> 
> The test box SLES12 SP3 has glibc2.22, so we have to update. But I can't
> understand what does it mean - " random reads seem to be done with rep
> movsb "

While the test is running, run
	perf top

and select one of the busiest functions (hopefully a memcpy function):

  90.14%  libc-2.24.so         [.] __memmove_avx_unaligned_erms
   3.85%  [unknown]            [k] 0x00007f7414f5c4ff
   0.53%  [kernel]             [k] wait_consider_task
   0.33%  fio                  [.] get_io_u
   0.29%  fio                  [.] td_io_queue
   0.26%  fio                  [.] io_u_sync_complete

Hit enter twice to get to the assembly language code.  This shows you
the amount of time the CPUs are spending in each instruction (based on
sampling):

  0.03 │ 2b:   cmp    __x86_shared_non_temporal_threshold,%rdx
  0.01 │     ↓ jae    15d
       │       cmp    %rsi,%rdi
  0.00 │     ↓ jb     4c
       │     ↓ je     51
       │       lea    (%rsi,%rdx,1),%r9
       │       cmp    %r9,%rdi
       │     ↓ jb     211
       │ 4c:   mov    %rdx,%rcx
 99.88 │       rep    movsb %ds:(%rsi),%es:(%rdi)
  0.08 │ 51: ← retq
       │ 52:   cmp    $0x10,%dl

That means 99.88% of 90.14% of the time is spend on rep movsb, presumably
reading from persistent memory (and writing to regular memory, but to
a buffer that's entirely in the CPU caches).

Some more examples of how to analyze the results:

If I remove the norandommap option, fio spends about 1% of the CPU time 
in a function maintaining the list of which LBAs it has read:
  89.63%  libc-2.24.so                [.] __memmove_avx_unaligned_erms
   3.28%  [unknown]                   [k] 0x00007f34da4824ff
   0.87%  fio                         [.] axmap_isset 
   0.58%  [kernel]                    [k] wait_consider_task
   0.38%  fio                         [.] get_io_u
   0.34%  fio                         [.] io_u_sync_complete


If I switch to random writes and remove the zerobuffers option, fio spends
about 7% of the CPU time creating non-zero write data (to then write
into persistent memory):
  79.04%  libc-2.24.so        [.] __memmove_avx_unaligned_erms
   6.98%  fio                 [.] get_io_u
   4.24%  [unknown]           [k] 0x00007f8ee86d34ff
   0.85%  fio                 [.] io_queue_event
   0.82%  fio                 [.] axmap_isset
   0.62%  fio                 [.] io_u_sync_complete
   0.62%  [kernel]            [k] wait_consider_task
   0.58%  fio                 [.] td_io_queue
   0.45%  fio                 [.] thread_main

get_io_u calls small_content_scramble which calls memcpy for 8
bytes at a time.  On my system, gcc chose to inline memcpy
rather than call the glibc library version, and it ends up just
using regular mov instructions:

       │      memcpy():
       │
       │      __fortify_function void *
       │      __NTH (memcpy (void *__restrict __dest, const void *__restrict __src,
       │                     size_t __len))
       │      {
       │        return __builtin___memcpy_chk (__dest, __src, __len, __bos0 (__dest));
 32.56 │        mov    %rax,-0x200(%rcx,%rsi,1)
  5.18 │        mov    0x0(%rbp),%rsi
       │      small_content_scramble():
       │                              get_verify = 1;
  2.49 │        add    $0x200,%rax
  2.81 │        mov    0x8(%rbp),%rdi
       │      memcpy():
 32.91 │        mov    %rsi,-0x10(%rcx)
  3.45 │        mov    %rdi,-0x8(%rcx)
       │      small_content_scramble():
       │                                      td->trim_batch = td->o.trim_backlog;
       │                              get_trim = 1;
       │                      }
       │
       │                      if (get_trim && get_next_trim(td, io_u))
       │                              return true;

small_content_scramble has hardly been touched since 2011, so it probably
hasn't had much performance analysis.  

One of the few changes made was to add an integer divide by 1000, which 
is always slow (painfully slow on some CPUs):

	offset = ((io_u->start_time.tv_nsec/1000) ^ boffset) & 511;

perf top doesn't show that taking time - I think the compiler realized
it could pull that calculation out of the loop and just do it once.  Different
compilers and compiler options might not realize that.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: fio 3.2
  2017-11-29  4:13                   ` Elliott, Robert (Persistent Memory)
@ 2017-11-29  4:44                     ` Rebecca Cran
  2017-11-30  4:04                       ` Jens Axboe
  2017-11-29  6:55                     ` Gavriliuk, Anton (HPS Ukraine)
                                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 39+ messages in thread
From: Rebecca Cran @ 2017-11-29  4:44 UTC (permalink / raw)
  To: Elliott, Robert (Persistent Memory)
  Cc: Gavriliuk, Anton (HPS Ukraine), Sitsofe Wheeler,
	fio@vger.kernel.org, Kani, Toshimitsu

I’ve not followed the whole discussion, but I did some tests a couple of months ago replacing the calls to memcpy in the memmap engine with a new fio_memcpy function, rebuilding fio with the -ftree-vectorize flag along with -msse2, -mavx etc. 
I saw an improvement of a few hundred MB/sec on my openSUSE system.

— 
Rebecca 

Sent from my iPhone

> On Nov 28, 2017, at 9:13 PM, Elliott, Robert (Persistent Memory) <elliott@hpe.com> wrote:
> 
> 
>> -----Original Message-----
>> From: Gavriliuk, Anton (HPS Ukraine)
>> Sent: Tuesday, November 28, 2017 9:35 PM
>> To: Elliott, Robert (Persistent Memory) <elliott@hpe.com>; Rebecca Cran
>> <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com>
>> Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com>
>> Subject: RE: fio 3.2
> 
> 
>> 4. The CPU instructions used by fio depend on the glibc library version.
>> As mentioned in an earlier fio thread, that changes a lot. With
>> libc2.24.so, random reads seem to be done with rep movsb.
>> 
>> The test box SLES12 SP3 has glibc2.22, so we have to update. But I can't
>> understand what does it mean - " random reads seem to be done with rep
>> movsb "
> 
> While the test is running, run
>    perf top
> 
> and select one of the busiest functions (hopefully a memcpy function):
> 
>  90.14%  libc-2.24.so         [.] __memmove_avx_unaligned_erms
>   3.85%  [unknown]            [k] 0x00007f7414f5c4ff
>   0.53%  [kernel]             [k] wait_consider_task
>   0.33%  fio                  [.] get_io_u
>   0.29%  fio                  [.] td_io_queue
>   0.26%  fio                  [.] io_u_sync_complete
> 
> Hit enter twice to get to the assembly language code.  This shows you
> the amount of time the CPUs are spending in each instruction (based on
> sampling):
> 
>  0.03 â”‚ 2b:   cmp    __x86_shared_non_temporal_threshold,%rdx
>  0.01 â”‚     â†“ jae    15d
>       â”‚       cmp    %rsi,%rdi
>  0.00 â”‚     â†“ jb     4c
>       â”‚     â†“ je     51
>       â”‚       lea    (%rsi,%rdx,1),%r9
>       â”‚       cmp    %r9,%rdi
>       â”‚     â†“ jb     211
>       â”‚ 4c:   mov    %rdx,%rcx
> 99.88 â”‚       rep    movsb %ds:(%rsi),%es:(%rdi)
>  0.08 â”‚ 51: â† retq
>       â”‚ 52:   cmp    $0x10,%dl
> 
> That means 99.88% of 90.14% of the time is spend on rep movsb, presumably
> reading from persistent memory (and writing to regular memory, but to
> a buffer that's entirely in the CPU caches).
> 
> Some more examples of how to analyze the results:
> 
> If I remove the norandommap option, fio spends about 1% of the CPU time 
> in a function maintaining the list of which LBAs it has read:
>  89.63%  libc-2.24.so                [.] __memmove_avx_unaligned_erms
>   3.28%  [unknown]                   [k] 0x00007f34da4824ff
>   0.87%  fio                         [.] axmap_isset 
>   0.58%  [kernel]                    [k] wait_consider_task
>   0.38%  fio                         [.] get_io_u
>   0.34%  fio                         [.] io_u_sync_complete
> 
> 
> If I switch to random writes and remove the zerobuffers option, fio spends
> about 7% of the CPU time creating non-zero write data (to then write
> into persistent memory):
>  79.04%  libc-2.24.so        [.] __memmove_avx_unaligned_erms
>   6.98%  fio                 [.] get_io_u
>   4.24%  [unknown]           [k] 0x00007f8ee86d34ff
>   0.85%  fio                 [.] io_queue_event
>   0.82%  fio                 [.] axmap_isset
>   0.62%  fio                 [.] io_u_sync_complete
>   0.62%  [kernel]            [k] wait_consider_task
>   0.58%  fio                 [.] td_io_queue
>   0.45%  fio                 [.] thread_main
> 
> get_io_u calls small_content_scramble which calls memcpy for 8
> bytes at a time.  On my system, gcc chose to inline memcpy
> rather than call the glibc library version, and it ends up just
> using regular mov instructions:
> 
>       â”‚      memcpy():
>       â”‚
>       â”‚      __fortify_function void *
>       â”‚      __NTH (memcpy (void *__restrict __dest, const void *__restrict __src,
>       â”‚                     size_t __len))
>       â”‚      {
>       â”‚        return __builtin___memcpy_chk (__dest, __src, __len, __bos0 (__dest));
> 32.56 â”‚        mov    %rax,-0x200(%rcx,%rsi,1)
>  5.18 â”‚        mov    0x0(%rbp),%rsi
>       â”‚      small_content_scramble():
>       â”‚                              get_verify = 1;
>  2.49 â”‚        add    $0x200,%rax
>  2.81 â”‚        mov    0x8(%rbp),%rdi
>       â”‚      memcpy():
> 32.91 â”‚        mov    %rsi,-0x10(%rcx)
>  3.45 â”‚        mov    %rdi,-0x8(%rcx)
>       â”‚      small_content_scramble():
>       â”‚                                      td->trim_batch = td->o.trim_backlog;
>       â”‚                              get_trim = 1;
>       â”‚                      }
>       â”‚
>       â”‚                      if (get_trim && get_next_trim(td, io_u))
>       â”‚                              return true;
> 
> small_content_scramble has hardly been touched since 2011, so it probably
> hasn't had much performance analysis.  
> 
> One of the few changes made was to add an integer divide by 1000, which 
> is always slow (painfully slow on some CPUs):
> 
>    offset = ((io_u->start_time.tv_nsec/1000) ^ boffset) & 511;
> 
> perf top doesn't show that taking time - I think the compiler realized
> it could pull that calculation out of the loop and just do it once.  Different
> compilers and compiler options might not realize that.
> 
> 
> \x04èº{.nÇ+‰·Ÿ®‰†+%ŠËlzwm…ébë§²æìr¸›yø¨Š{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hš‹àz¹\x1e®w¥¢¸\f¢·¦j:+v‰¨ŠwèjØm¶Ÿÿ¾\a«‘êçzZ+ƒùšŽŠÝ¢j"ú!¶i


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: fio 3.2
  2017-11-29  4:44                     ` Rebecca Cran
@ 2017-11-30  4:04                       ` Jens Axboe
  0 siblings, 0 replies; 39+ messages in thread
From: Jens Axboe @ 2017-11-30  4:04 UTC (permalink / raw)
  To: Rebecca Cran, Elliott, Robert (Persistent Memory)
  Cc: Gavriliuk, Anton (HPS Ukraine), Sitsofe Wheeler,
	fio@vger.kernel.org, Kani, Toshimitsu

On 11/28/2017 09:44 PM, Rebecca Cran wrote:
> I’ve not followed the whole discussion, but I did some tests a couple
> of months ago replacing the calls to memcpy in the memmap engine with
> a new fio_memcpy function, rebuilding fio with the -ftree-vectorize
> flag along with -msse2, -mavx etc.  I saw an improvement of a few
> hundred MB/sec on my openSUSE system.

Care to send a patch for that? I think it'd be nice for fio to pack it's
own, especially when the performance varies so wildly and is platform
(and even version) dependent.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: fio 3.2
  2017-11-29  4:13                   ` Elliott, Robert (Persistent Memory)
  2017-11-29  4:44                     ` Rebecca Cran
@ 2017-11-29  6:55                     ` Gavriliuk, Anton (HPS Ukraine)
  2017-11-29  7:40                     ` Gavriliuk, Anton (HPS Ukraine)
  2017-11-30  4:21                     ` Jens Axboe
  3 siblings, 0 replies; 39+ messages in thread
From: Gavriliuk, Anton (HPS Ukraine) @ 2017-11-29  6:55 UTC (permalink / raw)
  To: Elliott, Robert (Persistent Memory), Rebecca Cran,
	Sitsofe Wheeler
  Cc: fio@vger.kernel.org, Kani, Toshimitsu

Ok.  I will drill down my pmem tests with perf tool.

On my system, gcc chose to inline memcpy
rather than call the glibc library version, and it ends up just
using regular mov instructions

What's your system ?
Which gcc version ?

Anton

-----Original Message-----
From: Elliott, Robert (Persistent Memory) 
Sent: Wednesday, November 29, 2017 6:14 AM
To: Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com>; Rebecca Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com>
Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com>
Subject: RE: fio 3.2


> -----Original Message-----
> From: Gavriliuk, Anton (HPS Ukraine)
> Sent: Tuesday, November 28, 2017 9:35 PM
> To: Elliott, Robert (Persistent Memory) <elliott@hpe.com>; Rebecca 
> Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com>
> Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com>
> Subject: RE: fio 3.2


> 4. The CPU instructions used by fio depend on the glibc library version.
> As mentioned in an earlier fio thread, that changes a lot. With 
> libc2.24.so, random reads seem to be done with rep movsb.
> 
> The test box SLES12 SP3 has glibc2.22, so we have to update. But I 
> can't understand what does it mean - " random reads seem to be done 
> with rep movsb "

While the test is running, run
	perf top

and select one of the busiest functions (hopefully a memcpy function):

  90.14%  libc-2.24.so         [.] __memmove_avx_unaligned_erms
   3.85%  [unknown]            [k] 0x00007f7414f5c4ff
   0.53%  [kernel]             [k] wait_consider_task
   0.33%  fio                  [.] get_io_u
   0.29%  fio                  [.] td_io_queue
   0.26%  fio                  [.] io_u_sync_complete

Hit enter twice to get to the assembly language code.  This shows you the amount of time the CPUs are spending in each instruction (based on
sampling):

  0.03 │ 2b:   cmp    __x86_shared_non_temporal_threshold,%rdx
  0.01 │     ↓ jae    15d
       │       cmp    %rsi,%rdi
  0.00 │     ↓ jb     4c
       │     ↓ je     51
       │       lea    (%rsi,%rdx,1),%r9
       │       cmp    %r9,%rdi
       │     ↓ jb     211
       │ 4c:   mov    %rdx,%rcx
 99.88 │       rep    movsb %ds:(%rsi),%es:(%rdi)
  0.08 │ 51: ← retq
       │ 52:   cmp    $0x10,%dl

That means 99.88% of 90.14% of the time is spend on rep movsb, presumably reading from persistent memory (and writing to regular memory, but to a buffer that's entirely in the CPU caches).

Some more examples of how to analyze the results:

If I remove the norandommap option, fio spends about 1% of the CPU time in a function maintaining the list of which LBAs it has read:
  89.63%  libc-2.24.so                [.] __memmove_avx_unaligned_erms
   3.28%  [unknown]                   [k] 0x00007f34da4824ff
   0.87%  fio                         [.] axmap_isset 
   0.58%  [kernel]                    [k] wait_consider_task
   0.38%  fio                         [.] get_io_u
   0.34%  fio                         [.] io_u_sync_complete


If I switch to random writes and remove the zerobuffers option, fio spends
about 7% of the CPU time creating non-zero write data (to then write
into persistent memory):
  79.04%  libc-2.24.so        [.] __memmove_avx_unaligned_erms
   6.98%  fio                 [.] get_io_u
   4.24%  [unknown]           [k] 0x00007f8ee86d34ff
   0.85%  fio                 [.] io_queue_event
   0.82%  fio                 [.] axmap_isset
   0.62%  fio                 [.] io_u_sync_complete
   0.62%  [kernel]            [k] wait_consider_task
   0.58%  fio                 [.] td_io_queue
   0.45%  fio                 [.] thread_main

get_io_u calls small_content_scramble which calls memcpy for 8
bytes at a time.  On my system, gcc chose to inline memcpy
rather than call the glibc library version, and it ends up just
using regular mov instructions:

       │      memcpy():
       │
       │      __fortify_function void *
       │      __NTH (memcpy (void *__restrict __dest, const void *__restrict __src,
       │                     size_t __len))
       │      {
       │        return __builtin___memcpy_chk (__dest, __src, __len, __bos0 (__dest));
 32.56 │        mov    %rax,-0x200(%rcx,%rsi,1)
  5.18 │        mov    0x0(%rbp),%rsi
       │      small_content_scramble():
       │                              get_verify = 1;
  2.49 │        add    $0x200,%rax
  2.81 │        mov    0x8(%rbp),%rdi
       │      memcpy():
 32.91 │        mov    %rsi,-0x10(%rcx)
  3.45 │        mov    %rdi,-0x8(%rcx)
       │      small_content_scramble():
       │                                      td->trim_batch = td->o.trim_backlog;
       │                              get_trim = 1;
       │                      }
       │
       │                      if (get_trim && get_next_trim(td, io_u))
       │                              return true;

small_content_scramble has hardly been touched since 2011, so it probably
hasn't had much performance analysis.  

One of the few changes made was to add an integer divide by 1000, which 
is always slow (painfully slow on some CPUs):

	offset = ((io_u->start_time.tv_nsec/1000) ^ boffset) & 511;

perf top doesn't show that taking time - I think the compiler realized
it could pull that calculation out of the loop and just do it once.  Different
compilers and compiler options might not realize that.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: fio 3.2
  2017-11-29  4:13                   ` Elliott, Robert (Persistent Memory)
  2017-11-29  4:44                     ` Rebecca Cran
  2017-11-29  6:55                     ` Gavriliuk, Anton (HPS Ukraine)
@ 2017-11-29  7:40                     ` Gavriliuk, Anton (HPS Ukraine)
  2017-11-29  8:22                       ` Elliott, Robert (Persistent Memory)
  2017-11-30  4:21                     ` Jens Axboe
  3 siblings, 1 reply; 39+ messages in thread
From: Gavriliuk, Anton (HPS Ukraine) @ 2017-11-29  7:40 UTC (permalink / raw)
  To: Elliott, Robert (Persistent Memory), Rebecca Cran,
	Sitsofe Wheeler
  Cc: fio@vger.kernel.org, Kani, Toshimitsu

Previously I used fio 2.12 and it works fine.  Now I build fio 3.2 with gcc 4.8.5 and it returns error,

dl560g10spmem01:~ # /usr/local/bin/fio --name=4-rand-rw-3xx --ioengine=mmap --iodepth=1 --rw=randrw --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --size=100g --numjobs=16 --group_reporting --runtime=120 --filename=/dev/dax0.0 --allow_file_create=0
4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1
...
fio-3.2
Starting 16 processes
4-rand-rw-3xx: failed to get file size of /dev/dax0.0

4-rand-rw-3xx: failed to get file size of /dev/dax0.0

4-rand-rw-3xx: failed to get file size of /dev/dax0.0

4-rand-rw-3xx: failed to get file size of /dev/dax0.0

4-rand-rw-3xx: failed to get file size of /dev/dax0.0

4-rand-rw-3xx: failed to get file size of /dev/dax0.0

4-rand-rw-3xx: failed to get file size of /dev/dax0.0

4-rand-rw-3xx: failed to get file size of /dev/dax0.0

4-rand-rw-3xx: failed to get file size of /dev/dax0.0

4-rand-rw-3xx: failed to get file size of /dev/dax0.0

4-rand-rw-3xx: failed to get file size of /dev/dax0.0

4-rand-rw-3xx: failed to get file size of /dev/dax0.0

4-rand-rw-3xx: failed to get file size of /dev/dax0.0

4-rand-rw-3xx: failed to get file size of /dev/dax0.0

4-rand-rw-3xx: failed to get file size of /dev/dax0.0

4-rand-rw-3xx: failed to get file size of /dev/dax0.0



Run status group 0 (all jobs):
dl560g10spmem01:~ #

-----Original Message-----
From: Gavriliuk, Anton (HPS Ukraine) 
Sent: Wednesday, November 29, 2017 8:56 AM
To: Elliott, Robert (Persistent Memory) <elliott@hpe.com>; Rebecca Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com>
Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com>
Subject: RE: fio 3.2

Ok.  I will drill down my pmem tests with perf tool.

On my system, gcc chose to inline memcpy rather than call the glibc library version, and it ends up just using regular mov instructions

What's your system ?
Which gcc version ?

Anton

-----Original Message-----
From: Elliott, Robert (Persistent Memory)
Sent: Wednesday, November 29, 2017 6:14 AM
To: Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com>; Rebecca Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com>
Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com>
Subject: RE: fio 3.2


> -----Original Message-----
> From: Gavriliuk, Anton (HPS Ukraine)
> Sent: Tuesday, November 28, 2017 9:35 PM
> To: Elliott, Robert (Persistent Memory) <elliott@hpe.com>; Rebecca 
> Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com>
> Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com>
> Subject: RE: fio 3.2


> 4. The CPU instructions used by fio depend on the glibc library version.
> As mentioned in an earlier fio thread, that changes a lot. With 
> libc2.24.so, random reads seem to be done with rep movsb.
> 
> The test box SLES12 SP3 has glibc2.22, so we have to update. But I 
> can't understand what does it mean - " random reads seem to be done 
> with rep movsb "

While the test is running, run
	perf top

and select one of the busiest functions (hopefully a memcpy function):

  90.14%  libc-2.24.so         [.] __memmove_avx_unaligned_erms
   3.85%  [unknown]            [k] 0x00007f7414f5c4ff
   0.53%  [kernel]             [k] wait_consider_task
   0.33%  fio                  [.] get_io_u
   0.29%  fio                  [.] td_io_queue
   0.26%  fio                  [.] io_u_sync_complete

Hit enter twice to get to the assembly language code.  This shows you the amount of time the CPUs are spending in each instruction (based on
sampling):

  0.03 │ 2b:   cmp    __x86_shared_non_temporal_threshold,%rdx
  0.01 │     ↓ jae    15d
       │       cmp    %rsi,%rdi
  0.00 │     ↓ jb     4c
       │     ↓ je     51
       │       lea    (%rsi,%rdx,1),%r9
       │       cmp    %r9,%rdi
       │     ↓ jb     211
       │ 4c:   mov    %rdx,%rcx
 99.88 │       rep    movsb %ds:(%rsi),%es:(%rdi)
  0.08 │ 51: ← retq
       │ 52:   cmp    $0x10,%dl

That means 99.88% of 90.14% of the time is spend on rep movsb, presumably reading from persistent memory (and writing to regular memory, but to a buffer that's entirely in the CPU caches).

Some more examples of how to analyze the results:

If I remove the norandommap option, fio spends about 1% of the CPU time in a function maintaining the list of which LBAs it has read:
  89.63%  libc-2.24.so                [.] __memmove_avx_unaligned_erms
   3.28%  [unknown]                   [k] 0x00007f34da4824ff
   0.87%  fio                         [.] axmap_isset 
   0.58%  [kernel]                    [k] wait_consider_task
   0.38%  fio                         [.] get_io_u
   0.34%  fio                         [.] io_u_sync_complete


If I switch to random writes and remove the zerobuffers option, fio spends
about 7% of the CPU time creating non-zero write data (to then write
into persistent memory):
  79.04%  libc-2.24.so        [.] __memmove_avx_unaligned_erms
   6.98%  fio                 [.] get_io_u
   4.24%  [unknown]           [k] 0x00007f8ee86d34ff
   0.85%  fio                 [.] io_queue_event
   0.82%  fio                 [.] axmap_isset
   0.62%  fio                 [.] io_u_sync_complete
   0.62%  [kernel]            [k] wait_consider_task
   0.58%  fio                 [.] td_io_queue
   0.45%  fio                 [.] thread_main

get_io_u calls small_content_scramble which calls memcpy for 8
bytes at a time.  On my system, gcc chose to inline memcpy
rather than call the glibc library version, and it ends up just
using regular mov instructions:

       │      memcpy():
       │
       │      __fortify_function void *
       │      __NTH (memcpy (void *__restrict __dest, const void *__restrict __src,
       │                     size_t __len))
       │      {
       │        return __builtin___memcpy_chk (__dest, __src, __len, __bos0 (__dest));
 32.56 │        mov    %rax,-0x200(%rcx,%rsi,1)
  5.18 │        mov    0x0(%rbp),%rsi
       │      small_content_scramble():
       │                              get_verify = 1;
  2.49 │        add    $0x200,%rax
  2.81 │        mov    0x8(%rbp),%rdi
       │      memcpy():
 32.91 │        mov    %rsi,-0x10(%rcx)
  3.45 │        mov    %rdi,-0x8(%rcx)
       │      small_content_scramble():
       │                                      td->trim_batch = td->o.trim_backlog;
       │                              get_trim = 1;
       │                      }
       │
       │                      if (get_trim && get_next_trim(td, io_u))
       │                              return true;

small_content_scramble has hardly been touched since 2011, so it probably
hasn't had much performance analysis.  

One of the few changes made was to add an integer divide by 1000, which 
is always slow (painfully slow on some CPUs):

	offset = ((io_u->start_time.tv_nsec/1000) ^ boffset) & 511;

perf top doesn't show that taking time - I think the compiler realized
it could pull that calculation out of the loop and just do it once.  Different
compilers and compiler options might not realize that.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: fio 3.2
  2017-11-29  7:40                     ` Gavriliuk, Anton (HPS Ukraine)
@ 2017-11-29  8:22                       ` Elliott, Robert (Persistent Memory)
  2017-11-29 13:12                         ` Gavriliuk, Anton (HPS Ukraine)
  0 siblings, 1 reply; 39+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2017-11-29  8:22 UTC (permalink / raw)
  To: Gavriliuk, Anton (HPS Ukraine), Rebecca Cran, Sitsofe Wheeler,
	Tomohiro Kusumi
  Cc: fio@vger.kernel.org, Kani, Toshimitsu



> -----Original Message-----
> From: Gavriliuk, Anton (HPS Ukraine)
> Sent: Wednesday, November 29, 2017 1:40 AM
> To: Elliott, Robert (Persistent Memory) <elliott@hpe.com>; Rebecca
> Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com>
> Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com>
> Subject: RE: fio 3.2
> 
> Previously I used fio 2.12 and it works fine.  Now I build fio 3.2
> with gcc 4.8.5 and it returns error,
> 
> dl560g10spmem01:~ # /usr/local/bin/fio --name=4-rand-rw-3xx --
> ioengine=mmap --iodepth=1 --rw=randrw --
> bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --
> size=100g --numjobs=16 --group_reporting --runtime=120 --
> filename=/dev/dax0.0 --allow_file_create=0
> 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-
> 256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1
> ...
> fio-3.2
> Starting 16 processes
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0

I see the same thing.

Commit 0f34169a broke it, moving a check from a non-character
device path up a level to where it now triggers on a character
device (for which the size is left set to -1ULL).

This gets it working again:
	git revert 0f34169a
	make

Tomohiro, could you help prepare a better fix?






^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: fio 3.2
  2017-11-29  8:22                       ` Elliott, Robert (Persistent Memory)
@ 2017-11-29 13:12                         ` Gavriliuk, Anton (HPS Ukraine)
  2017-11-29 17:17                           ` Sitsofe Wheeler
  0 siblings, 1 reply; 39+ messages in thread
From: Gavriliuk, Anton (HPS Ukraine) @ 2017-11-29 13:12 UTC (permalink / raw)
  To: Elliott, Robert (Persistent Memory), Rebecca Cran,
	Sitsofe Wheeler, Tomohiro Kusumi
  Cc: fio@vger.kernel.org, Kani, Toshimitsu

Latest findings...

With the fio2.12 /dev/dax & mmap().  I run trace before start test.  We can see above open(), mmap()

    5.630816 cpu=65 pid=45154 tgid=45154 open [2] entry *pathname=0x7f62b6fabed0 flags=NOATIME mode=0600 filename: /dev/dax0.0
    5.630819 cpu=65 pid=45154 tgid=45154 open [2] ret=3 syscallbeg=    0.000003 *pathname=0x7f62b6fabed0 flags=NOATIME mode=0600
    5.630833 cpu=65 pid=45154 tgid=45154 mmap [9] entry *addr=0x0 length=-2147483648 prot=READ|WRITE flags=SHARED fd=3 offset=0x0
    5.630836 cpu=65 pid=45154 tgid=45154 mmap [9] ret=0x7f1a00200000 syscallbeg=    0.000003 *addr=0x0 length=-2147483648 prot=READ|WRITE flags=SHARED fd=3 offset=0x0
    5.630836 cpu=65 pid=45154 tgid=45154 madvise [28] entry *addr=0x7f1a00200000 length=-2147483648 advise=0x1
    5.630837 cpu=65 pid=45154 tgid=45154 madvise [28] ret=0x0 syscallbeg=    0.000001 *addr=0x7f1a00200000 length=-2147483648 advise=0x1

How it could be that mmap() length size -2Gb ??? (length=-2147483648).  I pointed +290g,

dl560g10spmem01:/var/work/fio-3.2 # /usr/bin/fio --filename=/dev/dax0.0 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --size=290g

-----Original Message-----
From: Elliott, Robert (Persistent Memory) 
Sent: Wednesday, November 29, 2017 10:23 AM
To: Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com>; Rebecca Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com>; Tomohiro Kusumi <tkusumi@tuxera.com>
Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com>
Subject: RE: fio 3.2



> -----Original Message-----
> From: Gavriliuk, Anton (HPS Ukraine)
> Sent: Wednesday, November 29, 2017 1:40 AM
> To: Elliott, Robert (Persistent Memory) <elliott@hpe.com>; Rebecca 
> Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com>
> Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com>
> Subject: RE: fio 3.2
> 
> Previously I used fio 2.12 and it works fine.  Now I build fio 3.2 
> with gcc 4.8.5 and it returns error,
> 
> dl560g10spmem01:~ # /usr/local/bin/fio --name=4-rand-rw-3xx -- 
> ioengine=mmap --iodepth=1 --rw=randrw --
> bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 -- 
> size=100g --numjobs=16 --group_reporting --runtime=120 --
> filename=/dev/dax0.0 --allow_file_create=0
> 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B- 
> 256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1 ...
> fio-3.2
> Starting 16 processes
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0

I see the same thing.

Commit 0f34169a broke it, moving a check from a non-character device path up a level to where it now triggers on a character device (for which the size is left set to -1ULL).

This gets it working again:
	git revert 0f34169a
	make

Tomohiro, could you help prepare a better fix?






^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: fio 3.2
  2017-11-29 13:12                         ` Gavriliuk, Anton (HPS Ukraine)
@ 2017-11-29 17:17                           ` Sitsofe Wheeler
  2017-11-29 20:24                             ` Elliott, Robert (Persistent Memory)
  0 siblings, 1 reply; 39+ messages in thread
From: Sitsofe Wheeler @ 2017-11-29 17:17 UTC (permalink / raw)
  To: Gavriliuk, Anton (HPS Ukraine)
  Cc: Elliott, Robert (Persistent Memory), Rebecca Cran,
	Tomohiro Kusumi, fio@vger.kernel.org, Kani, Toshimitsu

Unsigned number being printed as signed?

On 29 November 2017 at 13:12, Gavriliuk, Anton (HPS Ukraine)
<anton.gavriliuk@hpe.com> wrote:
> Latest findings...
>
> With the fio2.12 /dev/dax & mmap().  I run trace before start test.  We can see above open(), mmap()
>
>     5.630816 cpu=65 pid=45154 tgid=45154 open [2] entry *pathname=0x7f62b6fabed0 flags=NOATIME mode=0600 filename: /dev/dax0.0
>     5.630819 cpu=65 pid=45154 tgid=45154 open [2] ret=3 syscallbeg=    0.000003 *pathname=0x7f62b6fabed0 flags=NOATIME mode=0600
>     5.630833 cpu=65 pid=45154 tgid=45154 mmap [9] entry *addr=0x0 length=-2147483648 prot=READ|WRITE flags=SHARED fd=3 offset=0x0
>     5.630836 cpu=65 pid=45154 tgid=45154 mmap [9] ret=0x7f1a00200000 syscallbeg=    0.000003 *addr=0x0 length=-2147483648 prot=READ|WRITE flags=SHARED fd=3 offset=0x0
>     5.630836 cpu=65 pid=45154 tgid=45154 madvise [28] entry *addr=0x7f1a00200000 length=-2147483648 advise=0x1
>     5.630837 cpu=65 pid=45154 tgid=45154 madvise [28] ret=0x0 syscallbeg=    0.000001 *addr=0x7f1a00200000 length=-2147483648 advise=0x1
>
> How it could be that mmap() length size -2Gb ??? (length=-2147483648).  I pointed +290g,
>
> dl560g10spmem01:/var/work/fio-3.2 # /usr/bin/fio --filename=/dev/dax0.0 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --size=290g

-- 
Sitsofe | http://sucs.org/~sits/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: fio 3.2
  2017-11-29 17:17                           ` Sitsofe Wheeler
@ 2017-11-29 20:24                             ` Elliott, Robert (Persistent Memory)
  0 siblings, 0 replies; 39+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2017-11-29 20:24 UTC (permalink / raw)
  To: Sitsofe Wheeler, Gavriliuk, Anton (HPS Ukraine)
  Cc: Rebecca Cran, Tomohiro Kusumi, fio@vger.kernel.org,
	Kani, Toshimitsu



> From: fio-owner@vger.kernel.org [mailto:fio-owner@vger.kernel.org] On
> Behalf Of Sitsofe Wheeler
...
> On 29 November 2017 at 13:12, Gavriliuk, Anton (HPS Ukraine)
> <anton.gavriliuk@hpe.com> wrote:
> > Latest findings...
> >
> > With the fio2.12 /dev/dax & mmap().  I run trace before start test.  We
> can see above open(), mmap()
> >
...
> >     5.630837 cpu=65 pid=45154 tgid=45154 madvise [28] ret=0x0
> syscallbeg=    0.000001 *addr=0x7f1a00200000 length=-2147483648 advise=0x1
> >
> > How it could be that mmap() length size -2Gb ??? (length=-2147483648).
> > I pointed +290g,
>
> Unsigned number being printed as signed?

Yes, that's 0x80000000 in hex.

The size_t argument is indeed unsigned (ssize_t is signed):
       int madvise(void *addr, size_t length, int advice);

This print is apparently from:
	https://github.com/HewlettPackard/LinuxKI

in src/kiinfo/syscalls.c print_syscall_args():	
	switch (format) {
		case SKIP:    break;
		case DECIMAL: printf ("%c%s=%i", fsep, label,  (int32)argval); break;
		case HEX:     printf ("%c%s=0x%llx", fsep, label,  argval); break;
		case OCTAL:   printf ("%c%s=0%llo", fsep, label, argval); break;

printed as a signed 32-bit value because of src/kiinfo/globals.c:
	{"madvise", "ret", HEX, "*addr", HEX, "length", DECIMAL, "advise", HEX, NULL, SKIP, NULL, SKIP, NULL, SKIP},
	{"mmap", "ret", HEX, "*addr", HEX, "length", DECIMAL, "prot", MMAP_PROT, "flags", MMAP_FLAGS, "fd", DECIMAL, "offset", HEX},

That project should differentiate between signed and unsigned arguments,
and add support for 64-bit signed and unsigned arguments as well.





^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: fio 3.2
  2017-11-29  4:13                   ` Elliott, Robert (Persistent Memory)
                                       ` (2 preceding siblings ...)
  2017-11-29  7:40                     ` Gavriliuk, Anton (HPS Ukraine)
@ 2017-11-30  4:21                     ` Jens Axboe
  2017-11-30 14:17                       ` Gavriliuk, Anton (HPS Ukraine)
  2017-11-30 16:13                       ` Jens Axboe
  3 siblings, 2 replies; 39+ messages in thread
From: Jens Axboe @ 2017-11-30  4:21 UTC (permalink / raw)
  To: Elliott, Robert (Persistent Memory),
	Gavriliuk, Anton (HPS Ukraine), Rebecca Cran, Sitsofe Wheeler
  Cc: fio@vger.kernel.org, Kani, Toshimitsu

On 11/28/2017 09:13 PM, Elliott, Robert (Persistent Memory) wrote:
> small_content_scramble has hardly been touched since 2011, so it probably
> hasn't had much performance analysis.  

That's fair, would be a good thing to look at, especially since it's on
by default.

> One of the few changes made was to add an integer divide by 1000, which 
> is always slow (painfully slow on some CPUs):
> 
> 	offset = ((io_u->start_time.tv_nsec/1000) ^ boffset) & 511;
> 
> perf top doesn't show that taking time - I think the compiler realized
> it could pull that calculation out of the loop and just do it once.  Different
> compilers and compiler options might not realize that.

Yeah, int divides suck. This didn't used to be like this, it got added
with the nsec conversion. I have fixed this up now.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: fio 3.2
  2017-11-30  4:21                     ` Jens Axboe
@ 2017-11-30 14:17                       ` Gavriliuk, Anton (HPS Ukraine)
  2017-11-30 21:23                         ` Jens Axboe
  2017-11-30 16:13                       ` Jens Axboe
  1 sibling, 1 reply; 39+ messages in thread
From: Gavriliuk, Anton (HPS Ukraine) @ 2017-11-30 14:17 UTC (permalink / raw)
  To: Elliott, Robert (Persistent Memory), Rebecca Cran,
	Sitsofe Wheeler
  Cc: fio@vger.kernel.org, Kani, Toshimitsu

It's there any chance to fix it ?

dl560g10spmem01:/var/work # /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --refill_buffers --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --size=290g
4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1
...
fio-2.99
Starting 16 processes
4-rand-rw-3xx: failed to get file size of /dev/dax0.0

4-rand-rw-3xx: failed to get file size of /dev/dax0.0

4-rand-rw-3xx: failed to get file size of /dev/dax0.0

4-rand-rw-3xx: failed to get file size of /dev/dax0.0

4-rand-rw-3xx: failed to get file size of /dev/dax0.0

4-rand-rw-3xx: failed to get file size of /dev/dax0.0

4-rand-rw-3xx: failed to get file size of /dev/dax0.0

-----Original Message-----
From: Jens Axboe [mailto:axboe@kernel.dk] 
Sent: Thursday, November 30, 2017 6:22 AM
To: Elliott, Robert (Persistent Memory) <elliott@hpe.com>; Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com>; Rebecca Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com>
Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com>
Subject: Re: fio 3.2

On 11/28/2017 09:13 PM, Elliott, Robert (Persistent Memory) wrote:
> small_content_scramble has hardly been touched since 2011, so it 
> probably hasn't had much performance analysis.

That's fair, would be a good thing to look at, especially since it's on by default.

> One of the few changes made was to add an integer divide by 1000, 
> which is always slow (painfully slow on some CPUs):
> 
> 	offset = ((io_u->start_time.tv_nsec/1000) ^ boffset) & 511;
> 
> perf top doesn't show that taking time - I think the compiler realized 
> it could pull that calculation out of the loop and just do it once.  
> Different compilers and compiler options might not realize that.

Yeah, int divides suck. This didn't used to be like this, it got added with the nsec conversion. I have fixed this up now.

--
Jens Axboe


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: fio 3.2
  2017-11-30 14:17                       ` Gavriliuk, Anton (HPS Ukraine)
@ 2017-11-30 21:23                         ` Jens Axboe
  2017-12-01  5:21                           ` Gavriliuk, Anton (HPS Ukraine)
                                             ` (3 more replies)
  0 siblings, 4 replies; 39+ messages in thread
From: Jens Axboe @ 2017-11-30 21:23 UTC (permalink / raw)
  To: Gavriliuk, Anton (HPS Ukraine),
	Elliott, Robert (Persistent Memory), Rebecca Cran,
	Sitsofe Wheeler
  Cc: fio@vger.kernel.org, Kani, Toshimitsu

On 11/30/2017 07:17 AM, Gavriliuk, Anton (HPS Ukraine) wrote:
> It's there any chance to fix it ?
> 
> dl560g10spmem01:/var/work # /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --refill_buffers --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --size=290g
> 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1
> ...
> fio-2.99
> Starting 16 processes
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0

Should already be fixed as of yesterday morning.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: fio 3.2
  2017-11-30 21:23                         ` Jens Axboe
@ 2017-12-01  5:21                           ` Gavriliuk, Anton (HPS Ukraine)
  2017-12-01  7:15                             ` Robert Elliott (Persistent Memory)
  2017-12-03  3:24                           ` Gavriliuk, Anton (HPS Ukraine)
                                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 39+ messages in thread
From: Gavriliuk, Anton (HPS Ukraine) @ 2017-12-01  5:21 UTC (permalink / raw)
  To: Jens Axboe, Robert Elliott (Persistent Memory), Rebecca Cran,
	Sitsofe Wheeler
  Cc: fio@vger.kernel.org, Kani, Toshimitsu

Ok, thanks Jens and Elliott

I fixed that.

Now with /dev/dax0.0 and mmap ioengine slat is gone.  But if fio maps entire /dev/dax0.0 device once and the using CPU load/store, why clat number's so high, 6us for read, 12us for write ??

dl560g10spmem01:/var/work/fio # taskset -c 0-19 /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --size=290g
4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1
...
fio-3.2-53-ga7d0-dirty
Starting 16 processes
Jobs: 1 (f=1): [_(5),m(1),_(10)][100.0%][r=1142MiB/s,w=21.2GiB/s][r=22.4k,w=428k IOPS][eta 00m:00s]
4-rand-rw-3xx: (groupid=0, jobs=16): err= 0: pid=67241: Thu Nov 30 23:04:51 2017
   read: IOPS=37.0k, BW=1931MiB/s (2025MB/s)(232GiB/123040msec)
    clat (nsec): min=197, max=753983, avg=6709.21, stdev=4157.00
     lat (nsec): min=218, max=754005, avg=6740.89, stdev=4157.15
    clat percentiles (nsec):
     |  1.00th=[  572],  5.00th=[  996], 10.00th=[ 1528], 20.00th=[ 3568],
     | 30.00th=[ 4832], 40.00th=[ 6880], 50.00th=[ 7328], 60.00th=[ 7712],
     | 70.00th=[ 8096], 80.00th=[ 8512], 90.00th=[ 9536], 95.00th=[10688],
     | 99.00th=[25984], 99.50th=[31360], 99.90th=[40192], 99.95th=[45824],
     | 99.99th=[54016]
   bw (  KiB/s): min=110400, max=148800, per=6.28%, avg=124171.80, stdev=4366.58, samples=3910
   iops        : min= 2114, max= 2842, avg=2384.24, stdev=74.15, samples=3910
  write: IOPS=721k, BW=35.8GiB/s (38.5GB/s)(4408GiB/123040msec)
    clat (nsec): min=72, max=814061, avg=12174.57, stdev=7463.58
     lat (nsec): min=91, max=814132, avg=12221.87, stdev=7475.72
    clat percentiles (nsec):
     |  1.00th=[  684],  5.00th=[ 1400], 10.00th=[ 2064], 20.00th=[ 6304],
     | 30.00th=[ 8384], 40.00th=[12992], 50.00th=[13760], 60.00th=[14400],
     | 70.00th=[15040], 80.00th=[15936], 90.00th=[17024], 95.00th=[18048],
     | 99.00th=[46336], 99.50th=[59648], 99.90th=[63232], 99.95th=[64256],
     | 99.99th=[71168]
   bw (  MiB/s): min= 2197, max= 2647, per=6.28%, avg=2303.69, stdev=20.83, samples=3910
   iops        : min=43091, max=52348, avg=45290.89, stdev=447.39, samples=3910
  lat (nsec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=1.99%, 1000=1.53%
  lat (usec)   : 2=6.41%, 4=6.96%, 10=18.78%, 20=61.82%, 50=1.56%
  lat (usec)   : 100=0.95%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  cpu          : usr=99.80%, sys=0.18%, ctx=4735, majf=0, minf=2519628
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=4671537,88741744,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=1931MiB/s (2025MB/s), 1931MiB/s-1931MiB/s (2025MB/s-2025MB/s), io=232GiB (249GB), run=123040-123040msec
  WRITE: bw=35.8GiB/s (38.5GB/s), 35.8GiB/s-35.8GiB/s (38.5GB/s-38.5GB/s), io=4408GiB (4733GB), run=123040-123040msec

Fio numa placement doesn't work, when I remove taskset and add numa_cpu_nodes

dl560g10spmem01:/var/work/fio #
dl560g10spmem01:/var/work/fio # /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --size=300g --numa_cpu_nodes=0
4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1
...
fio-3.2-53-ga7d0-dirty
Starting 16 processes
fio: pid=67474, got signal=7
fio: pid=67481, got signal=7
fio: pid=67479, got signal=7
fio: pid=67482, got signal=7
fio: pid=67483, got signal=7
fio: pid=67480, got signal=7
fio: pid=67486, got signal=7
fio: pid=67485, got signal=7
fio: pid=67475, got signal=74),K(5),m(1),K(2),m(1)][0.4%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 29m:52s]
fio: pid=67478, got signal=7
fio: pid=67472, got signal=7
fio: pid=67484, got signal=7
fio: pid=67476, got signal=72),m(2),K(9),m(1)][0.5%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 30m:22s]
fio: pid=67487, got signal=7
fio: pid=67473, got signal=7
fio: pid=67477, got signal=7

4-rand-rw-3xx: (groupid=0, jobs=16): err= 0: pid=67472: Thu Nov 30 23:13:07 2017
  lat (usec)   : 4=12.80%, 10=22.30%, 20=62.03%, 50=2.87%
  cpu          : usr=0.00%, sys=0.00%, ctx=0, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=24,429,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
fio: file hash not empty on exit
dl560g10spmem01:/var/work/fio #



-----Original Message-----
From: Jens Axboe [mailto:axboe@kernel.dk] 
Sent: Thursday, November 30, 2017 11:24 PM
To: Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com>; Robert Elliott (Persistent Memory) <elliott@hpe.com>; Rebecca Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com>
Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com>
Subject: Re: fio 3.2

On 11/30/2017 07:17 AM, Gavriliuk, Anton (HPS Ukraine) wrote:
> It's there any chance to fix it ?
> 
> dl560g10spmem01:/var/work # /usr/local/bin/fio --filename=/dev/dax0.0 
> --rw=randrw --refill_buffers --randrepeat=0 --ioengine=mmap 
> --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 
> --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting 
> --name=4-rand-rw-3xx --size=290g
> 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 
> 4096B-256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1 ...
> fio-2.99
> Starting 16 processes
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0

Should already be fixed as of yesterday morning.

--
Jens Axboe


^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: fio 3.2
  2017-12-01  5:21                           ` Gavriliuk, Anton (HPS Ukraine)
@ 2017-12-01  7:15                             ` Robert Elliott (Persistent Memory)
  2017-12-03  9:35                               ` Sitsofe Wheeler
  0 siblings, 1 reply; 39+ messages in thread
From: Robert Elliott (Persistent Memory) @ 2017-12-01  7:15 UTC (permalink / raw)
  To: Gavriliuk, Anton (HPS Ukraine), Jens Axboe, Rebecca Cran,
	Sitsofe Wheeler
  Cc: fio@vger.kernel.org, Kani, Toshimitsu

> Fio numa placement doesn't work, when I remove taskset and add
> numa_cpu_nodes
>

> dl560g10spmem01:/var/work/fio # taskset -c 0-19 /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --size=290g

> dl560g10spmem01:/var/work/fio # /usr/local/bin/fio --
> filename=/dev/dax0.0 --rw=randrw --refill_buffers --norandommap --
> randrepeat=0 --ioengine=mmap --
> bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --
> iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-
> rand-rw-3xx --size=300g --numa_cpu_nodes=0
> Starting 16 processes
> fio: pid=67474, got signal=7

signal=7 occurs if the size is too big; you increased the size by 10 GiB and
probably hit the limit (not accounting for the ndctl metadata overhead).
It's probably a 300 GiB capacity 

While discussing NUMA, I'll mention something else I saw in Windows 
while fixing the thread affinities there.

At startup, fio spawns threads on all CPUs to measure the clocks
(fio_monotonic_clocktest).  If you've constrained the CPU affinity
outside fio, some of those will fail.  In Windows, something like
START /AFFINITY 0x55555555 fio ...
can cause half of the clock threads to fail.

My Windows machine stopped working around then, so I haven't gotten
around to trying a fix for that yet.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: fio 3.2
  2017-12-01  7:15                             ` Robert Elliott (Persistent Memory)
@ 2017-12-03  9:35                               ` Sitsofe Wheeler
  2017-12-03 17:10                                 ` Jens Axboe
  0 siblings, 1 reply; 39+ messages in thread
From: Sitsofe Wheeler @ 2017-12-03  9:35 UTC (permalink / raw)
  To: Robert Elliott (Persistent Memory)
  Cc: Gavriliuk, Anton (HPS Ukraine), Jens Axboe, Rebecca Cran,
	fio@vger.kernel.org, Kani, Toshimitsu

On 1 December 2017 at 07:15, Robert Elliott (Persistent Memory)
<elliott@hpe.com> wrote:
> While discussing NUMA, I'll mention something else I saw in Windows
> while fixing the thread affinities there.
>
> At startup, fio spawns threads on all CPUs to measure the clocks
> (fio_monotonic_clocktest).  If you've constrained the CPU affinity
> outside fio, some of those will fail.  In Windows, something like
> START /AFFINITY 0x55555555 fio ...
> can cause half of the clock threads to fail.

This is very weird and doesn't make any sense (but I believe you): if
you have multiple threads crammed on to the same CPUs the TSC no
longer looks like it monotonically increases? Surely it should be MORE
likely to increase because a thread is likely to be on the same CPU as
another and can't actually be running at the same time as the other?

Something suspicious is going on - to this day I'm still unsure as to
why https://github.com/sitsofe/fio/commit/1a7a1fd98ad749768212f49f5ba3e9a1392c4263
would sometimes fail on Windows (but seemingly not on Linux)...

-- 
Sitsofe | http://sucs.org/~sits/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: fio 3.2
  2017-12-03  9:35                               ` Sitsofe Wheeler
@ 2017-12-03 17:10                                 ` Jens Axboe
  2017-12-05 20:29                                   ` Elliott, Robert (Persistent Memory)
  0 siblings, 1 reply; 39+ messages in thread
From: Jens Axboe @ 2017-12-03 17:10 UTC (permalink / raw)
  To: Sitsofe Wheeler, Robert Elliott (Persistent Memory)
  Cc: Gavriliuk, Anton (HPS Ukraine), Rebecca Cran, fio@vger.kernel.org,
	Kani, Toshimitsu

On 12/03/2017 02:35 AM, Sitsofe Wheeler wrote:
> On 1 December 2017 at 07:15, Robert Elliott (Persistent Memory)
> <elliott@hpe.com> wrote:
>> While discussing NUMA, I'll mention something else I saw in Windows
>> while fixing the thread affinities there.
>>
>> At startup, fio spawns threads on all CPUs to measure the clocks
>> (fio_monotonic_clocktest).  If you've constrained the CPU affinity
>> outside fio, some of those will fail.  In Windows, something like
>> START /AFFINITY 0x55555555 fio ...
>> can cause half of the clock threads to fail.
> 
> This is very weird and doesn't make any sense (but I believe you): if
> you have multiple threads crammed on to the same CPUs the TSC no
> longer looks like it monotonically increases? Surely it should be MORE
> likely to increase because a thread is likely to be on the same CPU as
> another and can't actually be running at the same time as the other?

The threads fail to start, it's not a TSC failure. I'm guessing it's
because fio gets limited to a subset of the available CPUs, and that
causes fio to fail doing the clock check when fio_setaffinity() in
clock_thread_fn() fails.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: fio 3.2
  2017-12-03 17:10                                 ` Jens Axboe
@ 2017-12-05 20:29                                   ` Elliott, Robert (Persistent Memory)
  0 siblings, 0 replies; 39+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2017-12-05 20:29 UTC (permalink / raw)
  To: Jens Axboe, Sitsofe Wheeler
  Cc: Gavriliuk, Anton (HPS Ukraine), Rebecca Cran, fio@vger.kernel.org,
	Kani

> -----Original Message-----
> From: fio-owner@vger.kernel.org [mailto:fio-owner@vger.kernel.org] On
> Behalf Of Jens Axboe
> Sent: Sunday, December 3, 2017 11:11 AM
> To: Sitsofe Wheeler <sitsofe@gmail.com>; Robert Elliott (Persistent
> Memory) <elliott@hpe.com>
> Cc: Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com>; Rebecca Cran
> <rebecca@bluestop.org>; fio@vger.kernel.org; Kani, Toshimitsu
> <toshi.kani@hpe.com>
> Subject: Re: fio 3.2
> 
> On 12/03/2017 02:35 AM, Sitsofe Wheeler wrote:
> > On 1 December 2017 at 07:15, Robert Elliott (Persistent Memory)
> > <elliott@hpe.com> wrote:
> >> While discussing NUMA, I'll mention something else I saw in Windows
> >> while fixing the thread affinities there.
> >>
> >> At startup, fio spawns threads on all CPUs to measure the clocks
> >> (fio_monotonic_clocktest).  If you've constrained the CPU affinity
> >> outside fio, some of those will fail.  In Windows, something like
> >> START /AFFINITY 0x55555555 fio ...
> >> can cause half of the clock threads to fail.
> >
> > This is very weird and doesn't make any sense (but I believe you): if
> > you have multiple threads crammed on to the same CPUs the TSC no
> > longer looks like it monotonically increases? Surely it should be MORE
> > likely to increase because a thread is likely to be on the same CPU as
> > another and can't actually be running at the same time as the other?
> 
> The threads fail to start, it's not a TSC failure. I'm guessing it's
> because fio gets limited to a subset of the available CPUs, and that
> causes fio to fail doing the clock check when fio_setaffinity() in
> clock_thread_fn() fails.

I'll work on a patch to limit all threads (including clocktest
threads at startup and I/O threads) to the parent's affinity
mask in Windows.

In linux, processes and threads are not restricted to their parent's
affinity mask.  Although they inherit by default, they may call
sched_setaffinity with any values. The taskset and numactl manpages
don't mention that programs may override the settings requested
by the user.

Example:
   $ numactl -a -C 5 fio --debug=process -dax_mmap.fio
including:
   cpus_allowed_policy=split
   cpus_allowed=0-17,36-53
   thread
fio is allowed to place threads on all the CPUs:
   ROB: process 20358 cpu=5 thread affinity_count=1
   ROB: process 20353 cpu=0 thread affinity_count=1
   ROB: process 20355 cpu=2 thread affinity_count=1
   ROB: process 20354 cpu=1 thread affinity_count=1
   ROB: process 20356 cpu=3 thread affinity_count=1
   ...

Should fio apply the same filtering in linux?

If not, should it print an info or warning message when 
creating threads that violate the parent mask?

---
Robert Elliott, HPE Persistent Memory

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: fio 3.2
  2017-11-30 21:23                         ` Jens Axboe
  2017-12-01  5:21                           ` Gavriliuk, Anton (HPS Ukraine)
@ 2017-12-03  3:24                           ` Gavriliuk, Anton (HPS Ukraine)
  2017-12-04 17:12                           ` Gavriliuk, Anton (HPS Ukraine)
  2017-12-04 17:30                           ` Gavriliuk, Anton (HPS Ukraine)
  3 siblings, 0 replies; 39+ messages in thread
From: Gavriliuk, Anton (HPS Ukraine) @ 2017-12-03  3:24 UTC (permalink / raw)
  To: Jens Axboe, Robert Elliott (Persistent Memory), Rebecca Cran,
	Sitsofe Wheeler
  Cc: fio@vger.kernel.org, Kani, Toshimitsu

Ok, thanks.  Now I built latest fio available :-)

dl560g10spmem01:~ # /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --norandommap --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --size=290g --numa_cpu_nodes=0
4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1
...
fio-3.2-64-gaf7fb
Starting 16 processes
Jobs: 16 (f=16): [m(16)][8.0%][r=2138MiB/s,w=39.7GiB/s][r=42.2k,w=799k IOPS][eta 01m:44s]
       
       From perf top,
       
Samples: 1M of event 'cycles:ppp', Event count (approx.): 503978848613
Overhead  Shared Object       Symbol                                                                                                                        ◆
  93.34%  libc-2.22.so        [.] __memcpy_avx_unaligned      

Just can't drill it down,

__memcpy_avx_unaligned  /lib64/libc-2.22.so
Huh? No selection. Report to linux-kernel@vger.kernel.org 

While other functions can,

get_io_u  /usr/local/bin/fio
  0.08 │        xor    %eax,%esi
  0.01 │        shl    $0x6,%rsi
  0.02 │        and    $0x1c0,%esi
  0.00 │        sub    %rax,%rsi
  0.12 │        add    %rdx,%rsi
  0.00 │        sub    %rax,%rdx
       │        nop
       │                      /*
       │                       * Fill offset into start of cacheline, time into end
       │                       * of cacheline
       │                       */
       │                      iptr = (void *) p + (offset << 6);
       │                      *iptr = boffset;
 16.01 │ 230:   mov    %rax,(%rsi,%rax,1)
       │
       │                      iptr = (void *) p + 64 - 2 * sizeof(uint64_t);
       │                      iptr[0] = io_u->start_time.tv_sec;
  7.03 │        mov    0x0(%rbp),%rcx
 13.52 │        mov    %rcx,0x30(%rdx,%rax,1)                   

-----Original Message-----
From: Jens Axboe [mailto:axboe@kernel.dk] 
Sent: Thursday, November 30, 2017 11:24 PM
To: Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com>; Robert Elliott (Persistent Memory) <elliott@hpe.com>; Rebecca Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com>
Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com>
Subject: Re: fio 3.2

On 11/30/2017 07:17 AM, Gavriliuk, Anton (HPS Ukraine) wrote:
> It's there any chance to fix it ?
> 
> dl560g10spmem01:/var/work # /usr/local/bin/fio --filename=/dev/dax0.0 
> --rw=randrw --refill_buffers --randrepeat=0 --ioengine=mmap 
> --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 
> --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting 
> --name=4-rand-rw-3xx --size=290g
> 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 
> 4096B-256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1 ...
> fio-2.99
> Starting 16 processes
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0

Should already be fixed as of yesterday morning.

--
Jens Axboe


^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: fio 3.2
  2017-11-30 21:23                         ` Jens Axboe
  2017-12-01  5:21                           ` Gavriliuk, Anton (HPS Ukraine)
  2017-12-03  3:24                           ` Gavriliuk, Anton (HPS Ukraine)
@ 2017-12-04 17:12                           ` Gavriliuk, Anton (HPS Ukraine)
  2017-12-04 17:30                             ` Robert Elliott (Persistent Memory)
  2017-12-04 18:41                             ` Jens Axboe
  2017-12-04 17:30                           ` Gavriliuk, Anton (HPS Ukraine)
  3 siblings, 2 replies; 39+ messages in thread
From: Gavriliuk, Anton (HPS Ukraine) @ 2017-12-04 17:12 UTC (permalink / raw)
  To: Jens Axboe, Robert Elliott (Persistent Memory), Rebecca Cran,
	Sitsofe Wheeler
  Cc: fio@vger.kernel.org, Kani, Toshimitsu

I have run next command several times,

perf record /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --size=290g --numa_cpu_nodes=0

each perf report shows me toppest clock_thread_fn(),

Samples: 8M of event 'cycles:ppp', Event count (approx.): 23391179916235
Overhead  Command  Shared Object       Symbol
  75.27%  fio      fio                 [.] clock_thread_fn
  13.68%  fio      libc-2.22.so        [.] __memcpy_avx_unaligned
  10.05%  fio      fio                 [.] fill_random_buf

Samples: 8M of event 'cycles:ppp', Event count (approx.): 28889015047191
Overhead  Command  Shared Object       Symbol
  79.97%  fio      fio                 [.] clock_thread_fn
  11.08%  fio      libc-2.22.so        [.] __memcpy_avx_unaligned
   8.14%  fio      fio                 [.] fill_random_buf
   0.13%  fio      fio                 [.] get_io_u

Samples: 8M of event 'cycles:ppp', Event count (approx.): 57473976919058
Overhead  Command  Shared Object       Symbol
  89.94%  fio      fio                 [.] clock_thread_fn
   5.57%  fio      libc-2.22.so        [.] __memcpy_avx_unaligned
   4.09%  fio      fio                 [.] fill_random_buf
   0.07%  fio      fio                 [.] get_io_u

Samples: 8M of event 'cycles:ppp', Event count (approx.): 160828042961056
Overhead  Command  Shared Object       Symbol                                                                                                               ◆
  92.30%  fio      fio                 [.] clock_thread_fn                                                                                                  ▒
   4.10%  fio      [kernel.kallsyms]   [k] filemap_map_pages                                                                                                ▒
   1.99%  fio      libc-2.22.so        [.] __memcpy_avx_unaligned                                                                                           ▒
   1.46%  fio      fio                 [.] fill_random_buf                                                                                                  ▒
   0.02%  fio      fio                 [.] get_io_u                                                                                                         ▒
   0.02%  fio      fio                 [.] io_u_sync_complete      

Samples: 8M of event 'cycles:ppp', Event count (approx.): 163028870650810
Overhead  Command  Shared Object       Symbol
  87.68%  fio      fio                 [.] clock_thread_fn
   8.77%  fio      [kernel.kallsyms]   [k] run_timer_softirq
   1.96%  fio      libc-2.22.so        [.] __memcpy_avx_unaligned
   1.44%  fio      fio                 [.] fill_random_buf
   0.02%  fio      fio                 [.] get_io_u
   0.02%  fio      fio                 [.] io_u_sync_complete

 89.62 │       lock   cmpxchg %edi,(%rsi)                                                                                                                   ▒
       │     clock_thread_fn():                                                                                                                             ▒
       │                             seq = *t->seq;                                                                                                         ▒
       │                             if (seq == UINT_MAX)                                                                                                   ▒
       │                                     break;                                                                                                         ▒
       │                             __sync_synchronize();                                                                                                  ▒
       │                             tsc = get_cpu_clock();                                                                                                 ▒
       │                     } while (seq != atomic32_compare_and_swap(t->seq, seq, seq + 1));                                                              ▒
       │       cmp    %eax,%ecx                                                                                                                             ▒
       │     ↓ je     118                                                                                                                                   ◆
       │                     uint32_t seq;                                                                                                                  ▒
       │                     uint64_t tsc;                                                                                                                  ▒
       │                                                                                                                                                    ▒
       │                     c->cpu = t->cpu;                                                                                                               ▒
       │                     do {                                                                                                                           ▒
       │                             seq = *t->seq;                                                                                                         ▒
       │ c9:   mov    0x68(%rbx),%rcx                                                                                                                       ▒
 10.01 │       mov    (%rcx),%ecx        

May you please clarify, is it ok ?, what is the lock contention...

Anton                                                                                                      				  ▒                                         

-----Original Message-----
From: Gavriliuk, Anton (HPS Ukraine) 
Sent: Sunday, December 03, 2017 5:24 AM
To: 'Jens Axboe' <axboe@kernel.dk>; Robert Elliott (Persistent Memory) <elliott@hpe.com>; Rebecca Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com>
Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com>
Subject: RE: fio 3.2

Ok, thanks.  Now I built latest fio available :-)

dl560g10spmem01:~ # /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --norandommap --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --size=290g --numa_cpu_nodes=0
4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1 ...
fio-3.2-64-gaf7fb
Starting 16 processes
Jobs: 16 (f=16): [m(16)][8.0%][r=2138MiB/s,w=39.7GiB/s][r=42.2k,w=799k IOPS][eta 01m:44s]
       
       From perf top,
       
Samples: 1M of event 'cycles:ppp', Event count (approx.): 503978848613
Overhead  Shared Object       Symbol                                                                                                                        ◆
  93.34%  libc-2.22.so        [.] __memcpy_avx_unaligned      

Just can't drill it down,

__memcpy_avx_unaligned  /lib64/libc-2.22.so Huh? No selection. Report to linux-kernel@vger.kernel.org 

While other functions can,

get_io_u  /usr/local/bin/fio
  0.08 │        xor    %eax,%esi
  0.01 │        shl    $0x6,%rsi
  0.02 │        and    $0x1c0,%esi
  0.00 │        sub    %rax,%rsi
  0.12 │        add    %rdx,%rsi
  0.00 │        sub    %rax,%rdx
       │        nop
       │                      /*
       │                       * Fill offset into start of cacheline, time into end
       │                       * of cacheline
       │                       */
       │                      iptr = (void *) p + (offset << 6);
       │                      *iptr = boffset;
 16.01 │ 230:   mov    %rax,(%rsi,%rax,1)
       │
       │                      iptr = (void *) p + 64 - 2 * sizeof(uint64_t);
       │                      iptr[0] = io_u->start_time.tv_sec;
  7.03 │        mov    0x0(%rbp),%rcx
 13.52 │        mov    %rcx,0x30(%rdx,%rax,1)                   

-----Original Message-----
From: Jens Axboe [mailto:axboe@kernel.dk]
Sent: Thursday, November 30, 2017 11:24 PM
To: Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com>; Robert Elliott (Persistent Memory) <elliott@hpe.com>; Rebecca Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com>
Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com>
Subject: Re: fio 3.2

On 11/30/2017 07:17 AM, Gavriliuk, Anton (HPS Ukraine) wrote:
> It's there any chance to fix it ?
> 
> dl560g10spmem01:/var/work # /usr/local/bin/fio --filename=/dev/dax0.0 
> --rw=randrw --refill_buffers --randrepeat=0 --ioengine=mmap
> --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5
> --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting 
> --name=4-rand-rw-3xx --size=290g
> 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 
> 4096B-256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1 ...
> fio-2.99
> Starting 16 processes
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0

Should already be fixed as of yesterday morning.

--
Jens Axboe


^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: fio 3.2
  2017-12-04 17:12                           ` Gavriliuk, Anton (HPS Ukraine)
@ 2017-12-04 17:30                             ` Robert Elliott (Persistent Memory)
  2017-12-04 18:25                               ` Gavriliuk, Anton (HPS Ukraine)
  2017-12-04 18:41                             ` Jens Axboe
  1 sibling, 1 reply; 39+ messages in thread
From: Robert Elliott (Persistent Memory) @ 2017-12-04 17:30 UTC (permalink / raw)
  To: Gavriliuk, Anton (HPS Ukraine), Jens Axboe, Rebecca Cran,
	Sitsofe Wheeler
  Cc: fio@vger.kernel.org, Kani, Toshimitsu



> -----Original Message-----
> From: Gavriliuk, Anton (HPS Ukraine)
> Sent: Monday, December 4, 2017 11:12 AM
...
> I have run next command several times,
> 
> perf record /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --
> refill_buffers --norandommap --randrepeat=0 --ioengine=mmap --
> bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --
> iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw-
> 3xx --size=290g --numa_cpu_nodes=0
> 
> each perf report shows me toppest clock_thread_fn(),
> 
> Samples: 8M of event 'cycles:ppp', Event count (approx.): 23391179916235
> Overhead  Command  Shared Object       Symbol
>   75.27%  fio      fio                 [.] clock_thread_fn
>   13.68%  fio      libc-2.22.so        [.] __memcpy_avx_unaligned
>   10.05%  fio      fio                 [.] fill_random_buf

This just runs one pass through memory, so the startup overhead is magnified.

Add --time_based=1 so it really runs for 1800 s.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: fio 3.2
  2017-12-04 17:30                             ` Robert Elliott (Persistent Memory)
@ 2017-12-04 18:25                               ` Gavriliuk, Anton (HPS Ukraine)
  2017-12-04 18:50                                 ` Jens Axboe
  0 siblings, 1 reply; 39+ messages in thread
From: Gavriliuk, Anton (HPS Ukraine) @ 2017-12-04 18:25 UTC (permalink / raw)
  To: Robert Elliott (Persistent Memory), Jens Axboe, Rebecca Cran,
	Sitsofe Wheeler
  Cc: fio@vger.kernel.org, Kani, Toshimitsu

Ok, now test for 10 minutes with the same I/O latencies .  So there is no startup overhead.

dl560g10spmem01:~ # perf record /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=600 --group_reporting --name=4-rand-rw-3xx --size=290g --numa_cpu_nodes=0 --time_based=1
4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1
...
fio-3.2-64-gaf7fb
Starting 16 processes

   read: IOPS=36.7k, BW=1865MiB/s (1956MB/s)(1093GiB/600002msec)
    clat (nsec): min=120, max=16029k, avg=6897.44, stdev=5503.68
     lat (nsec): min=144, max=16029k, avg=6932.35, stdev=5504.08
  write: IOPS=697k, BW=34.6GiB/s (37.2GB/s)(20.3TiB/600002msec)
    clat (nsec): min=111, max=72027k, avg=12546.82, stdev=9845.85
     lat (nsec): min=132, max=72027k, avg=12592.74, stdev=9851.98

Samples: 38M of event 'cycles:ppp', Event count (approx.): 182321387032131
Overhead  Command  Shared Object       Symbol
  77.20%  fio      fio                 [.] clock_thread_fn
   8.32%  fio      libc-2.22.so        [.] __memcpy_avx_unaligned
   6.07%  fio      fio                 [.] fill_random_buf
   4.82%  fio      [kernel.kallsyms]   [k] page_remove_rmap
   3.02%  fio      [kernel.kallsyms]   [k] native_write_msr_safe
   0.10%  fio      fio                 [.] get_io_u

88.99 │       lock   cmpxchg %edi,(%rsi)
       │     clock_thread_fn():
       │                             seq = *t->seq;
       │                             if (seq == UINT_MAX)
       │                                     break;
       │                             __sync_synchronize();
       │                             tsc = get_cpu_clock();
       │                     } while (seq != atomic32_compare_and_swap(t->seq, seq, seq + 1));
       │       cmp    %eax,%ecx
       │     ↓ je     118
       │                     uint32_t seq;
       │                     uint64_t tsc;
       │
       │                     c->cpu = t->cpu;
       │                     do {
       │                             seq = *t->seq;
  0.01 │ c9:   mov    0x68(%rbx),%rcx
 10.57 │       mov    (%rcx),%ecx
       │                             if (seq == UINT_MAX)
       │       cmp    $0xffffffff,%ecx

The same for sequential write it's much faster, write latency dropped 12us -> 5us, results in bw.  But still blocked on clock_thread_fn(),

dl560g10spmem01:~ # perf record /usr/local/bin/fio --filename=/dev/dax0.0 --rw=write --refill_buffers --norandommap --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --iodepth=1 --numjobs=16 --runtime=600 --group_reporting --name=4-rand-rw-3xx --size=290g --numa_cpu_nodes=0 --time_based=1
4-rand-rw-3xx: (g=0): rw=write, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1
...
fio-3.2-64-gaf7fb
Starting 16 processes

  write: IOPS=949k, BW=47.2GiB/s (50.6GB/s)(27.6TiB/600002msec)
    clat (nsec): min=214, max=36102k, avg=7406.20, stdev=5435.29
     lat (nsec): min=233, max=36102k, avg=7442.12, stdev=5439.69

Samples: 38M of event 'cycles:ppp', Event count (approx.): 240873167676034
Overhead  Command  Shared Object        Symbol
  85.36%  fio      fio                  [.] clock_thread_fn
   6.26%  fio      fio                  [.] fill_random_buf
   5.07%  fio      libc-2.22.so         [.] __memcpy_avx_unaligned
   2.74%  fio      [kernel.kallsyms]    [k] radix_tree_next_chunk

91.46 │       lock   cmpxchg %edi,(%rsi)
       │     clock_thread_fn():
       │                             seq = *t->seq;
       │                             if (seq == UINT_MAX)
       │                                     break;
       │                             __sync_synchronize();
       │                             tsc = get_cpu_clock();
       │                     } while (seq != atomic32_compare_and_swap(t->seq, seq, seq + 1));
       │       cmp    %eax,%ecx
       │     ↓ je     118
       │                     uint32_t seq;
       │                     uint64_t tsc;
       │
       │                     c->cpu = t->cpu;
       │                     do {
       │                             seq = *t->seq;
  0.01 │ c9:   mov    0x68(%rbx),%rcx
  8.10 │       mov    (%rcx),%ecx

Run again, the same results.

Is it possible to break lat (nsec) down ?

Anton

-----Original Message-----
From: Robert Elliott (Persistent Memory) 
Sent: Monday, December 04, 2017 7:31 PM
To: Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com>; Jens Axboe <axboe@kernel.dk>; Rebecca Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com>
Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com>
Subject: RE: fio 3.2



> -----Original Message-----
> From: Gavriliuk, Anton (HPS Ukraine)
> Sent: Monday, December 4, 2017 11:12 AM
...
> I have run next command several times,
> 
> perf record /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw -- 
> refill_buffers --norandommap --randrepeat=0 --ioengine=mmap --
> bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --
> iodepth=1 --numjobs=16 --runtime=1800 --group_reporting 
> --name=4-rand-rw- 3xx --size=290g --numa_cpu_nodes=0
> 
> each perf report shows me toppest clock_thread_fn(),
> 
> Samples: 8M of event 'cycles:ppp', Event count (approx.): 23391179916235
> Overhead  Command  Shared Object       Symbol
>   75.27%  fio      fio                 [.] clock_thread_fn
>   13.68%  fio      libc-2.22.so        [.] __memcpy_avx_unaligned
>   10.05%  fio      fio                 [.] fill_random_buf

This just runs one pass through memory, so the startup overhead is magnified.

Add --time_based=1 so it really runs for 1800 s.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: fio 3.2
  2017-12-04 18:25                               ` Gavriliuk, Anton (HPS Ukraine)
@ 2017-12-04 18:50                                 ` Jens Axboe
  2017-12-04 19:17                                   ` Jeff Furlong
  0 siblings, 1 reply; 39+ messages in thread
From: Jens Axboe @ 2017-12-04 18:50 UTC (permalink / raw)
  To: Gavriliuk, Anton (HPS Ukraine),
	Robert Elliott (Persistent Memory), Rebecca Cran, Sitsofe Wheeler
  Cc: fio@vger.kernel.org, Kani, Toshimitsu

On 12/04/2017 11:25 AM, Gavriliuk, Anton (HPS Ukraine) wrote:
> Ok, now test for 10 minutes with the same I/O latencies .  So there is no startup overhead.
> 
> dl560g10spmem01:~ # perf record /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=600 --group_reporting --name=4-rand-rw-3xx --size=290g --numa_cpu_nodes=0 --time_based=1
> 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1
> ...
> fio-3.2-64-gaf7fb
> Starting 16 processes
> 
>    read: IOPS=36.7k, BW=1865MiB/s (1956MB/s)(1093GiB/600002msec)
>     clat (nsec): min=120, max=16029k, avg=6897.44, stdev=5503.68
>      lat (nsec): min=144, max=16029k, avg=6932.35, stdev=5504.08
>   write: IOPS=697k, BW=34.6GiB/s (37.2GB/s)(20.3TiB/600002msec)
>     clat (nsec): min=111, max=72027k, avg=12546.82, stdev=9845.85
>      lat (nsec): min=132, max=72027k, avg=12592.74, stdev=9851.98
> 
> Samples: 38M of event 'cycles:ppp', Event count (approx.): 182321387032131
> Overhead  Command  Shared Object       Symbol
>   77.20%  fio      fio                 [.] clock_thread_fn
>    8.32%  fio      libc-2.22.so        [.] __memcpy_avx_unaligned
>    6.07%  fio      fio                 [.] fill_random_buf
>    4.82%  fio      [kernel.kallsyms]   [k] page_remove_rmap
>    3.02%  fio      [kernel.kallsyms]   [k] native_write_msr_safe
>    0.10%  fio      fio                 [.] get_io_u

That seems so odd, it makes no sense. --no-children added to perf report?
I think this is skewed because the code is so hot for startup.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: fio 3.2
  2017-12-04 18:50                                 ` Jens Axboe
@ 2017-12-04 19:17                                   ` Jeff Furlong
  2017-12-04 19:29                                     ` Gavriliuk, Anton (HPS Ukraine)
  0 siblings, 1 reply; 39+ messages in thread
From: Jeff Furlong @ 2017-12-04 19:17 UTC (permalink / raw)
  To: Jens Axboe, Gavriliuk, Anton (HPS Ukraine),
	Robert Elliott (Persistent Memory), Rebecca Cran, Sitsofe Wheeler
  Cc: fio@vger.kernel.org, Kani, Toshimitsu

>That seems so odd, it makes no sense. --no-children added to perf report?
>I think this is skewed because the code is so hot for startup.

Perhaps you want to add --ramp_time=5s to remove startup jitter from host/device?

Regards,
Jeff


^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: fio 3.2
  2017-12-04 19:17                                   ` Jeff Furlong
@ 2017-12-04 19:29                                     ` Gavriliuk, Anton (HPS Ukraine)
  0 siblings, 0 replies; 39+ messages in thread
From: Gavriliuk, Anton (HPS Ukraine) @ 2017-12-04 19:29 UTC (permalink / raw)
  To: Jeff Furlong, Jens Axboe,
	Robert Elliott  <elliott@hpe.com>, Rebecca Cran (Persistent Memory),
	Sitsofe Wheeler
  Cc: fio@vger.kernel.org, Kani, Toshimitsu

Yes, it makes sence.  I also run perf top 15-20 secs later after fio.

Samples: 138K of event 'cycles:ppp', Event count (approx.): 95410433826
Overhead  Shared Object     Symbol
  93.41%  libc-2.22.so      [.] __memcpy_avx_unaligned
   2.36%  fio               [.] get_io_u
   0.65%  fio               [.] ramp_time_over
   0.63%  fio               [.] io_u_sync_complete
   0.34%  fio               [.] td_io_queue
   0.33%  [unknown]         [k] 0x00007f53f8a82b06
   0.33%  fio               [.] __fio_gettime

__memcpy_avx_unaligned  /lib64/libc-2.22.so
Huh? No selection. Report to linux-kernel@vger.kernel.org

It's a tremendous results.  But I'm still thinking is there any chance for further I/O perf boost.

Anton

-----Original Message-----
From: Jeff Furlong [mailto:jeff.furlong@wdc.com] 
Sent: Monday, December 04, 2017 9:18 PM
To: Jens Axboe <axboe@kernel.dk>; Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com>; Robert Elliott (Persistent Memory) <elliott@hpe.com>; Rebecca Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com>
Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com>
Subject: RE: fio 3.2

>That seems so odd, it makes no sense. --no-children added to perf report?
>I think this is skewed because the code is so hot for startup.

Perhaps you want to add --ramp_time=5s to remove startup jitter from host/device?

Regards,
Jeff


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: fio 3.2
  2017-12-04 17:12                           ` Gavriliuk, Anton (HPS Ukraine)
  2017-12-04 17:30                             ` Robert Elliott (Persistent Memory)
@ 2017-12-04 18:41                             ` Jens Axboe
  1 sibling, 0 replies; 39+ messages in thread
From: Jens Axboe @ 2017-12-04 18:41 UTC (permalink / raw)
  To: Gavriliuk, Anton (HPS Ukraine),
	Robert Elliott (Persistent Memory), Rebecca Cran, Sitsofe Wheeler
  Cc: fio@vger.kernel.org, Kani, Toshimitsu

On 12/04/2017 10:12 AM, Gavriliuk, Anton (HPS Ukraine) wrote:
> I have run next command several times,
> 
> perf record /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --size=290g --numa_cpu_nodes=0
> 
> each perf report shows me toppest clock_thread_fn(),

It's the cpu clock calibration that fio does. Just ignore it, it's run
before any of the actual work starts.

Additionally, please stop top posting. Proper net etiquette is replying
below the context you are responding to.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: fio 3.2
  2017-11-30 21:23                         ` Jens Axboe
                                             ` (2 preceding siblings ...)
  2017-12-04 17:12                           ` Gavriliuk, Anton (HPS Ukraine)
@ 2017-12-04 17:30                           ` Gavriliuk, Anton (HPS Ukraine)
  3 siblings, 0 replies; 39+ messages in thread
From: Gavriliuk, Anton (HPS Ukraine) @ 2017-12-04 17:30 UTC (permalink / raw)
  To: Jens Axboe, Robert Elliott (Persistent Memory), Rebecca Cran,
	Sitsofe Wheeler
  Cc: fio@vger.kernel.org, Kani, Toshimitsu

It's important, because fio shows 6.9us clat for reads and 13us for writes, long time..., access to the PMEM just a few hundred ns.  So I need to understand where we spend time.

   read: IOPS=36.7k, BW=1865MiB/s (1956MB/s)(232GiB/127291msec)
    clat (nsec): min=163, max=16033k, avg=6827.63, stdev=9340.91
     lat (nsec): min=188, max=16033k, avg=6860.92, stdev=9341.21
  write: IOPS=697k, BW=34.6GiB/s (37.2GB/s)(4408GiB/127291msec)
    clat (nsec): min=82, max=48028k, avg=12479.56, stdev=13459.80
     lat (nsec): min=106, max=48028k, avg=12524.94, stdev=13464.29

-----Original Message-----
From: Gavriliuk, Anton (HPS Ukraine) 
Sent: Monday, December 04, 2017 7:12 PM
To: 'Jens Axboe' <axboe@kernel.dk>; Robert Elliott (Persistent Memory) <elliott@hpe.com>; 'Rebecca Cran' <rebecca@bluestop.org>; 'Sitsofe Wheeler' <sitsofe@gmail.com>
Cc: 'fio@vger.kernel.org' <fio@vger.kernel.org>; Kani, Toshimitsu <toshi.kani@hpe.com>
Subject: RE: fio 3.2

I have run next command several times,

perf record /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --size=290g --numa_cpu_nodes=0

each perf report shows me toppest clock_thread_fn(),

Samples: 8M of event 'cycles:ppp', Event count (approx.): 23391179916235
Overhead  Command  Shared Object       Symbol
  75.27%  fio      fio                 [.] clock_thread_fn
  13.68%  fio      libc-2.22.so        [.] __memcpy_avx_unaligned
  10.05%  fio      fio                 [.] fill_random_buf

Samples: 8M of event 'cycles:ppp', Event count (approx.): 28889015047191
Overhead  Command  Shared Object       Symbol
  79.97%  fio      fio                 [.] clock_thread_fn
  11.08%  fio      libc-2.22.so        [.] __memcpy_avx_unaligned
   8.14%  fio      fio                 [.] fill_random_buf
   0.13%  fio      fio                 [.] get_io_u

Samples: 8M of event 'cycles:ppp', Event count (approx.): 57473976919058
Overhead  Command  Shared Object       Symbol
  89.94%  fio      fio                 [.] clock_thread_fn
   5.57%  fio      libc-2.22.so        [.] __memcpy_avx_unaligned
   4.09%  fio      fio                 [.] fill_random_buf
   0.07%  fio      fio                 [.] get_io_u

Samples: 8M of event 'cycles:ppp', Event count (approx.): 160828042961056
Overhead  Command  Shared Object       Symbol                                                                                                               ◆
  92.30%  fio      fio                 [.] clock_thread_fn                                                                                                  ▒
   4.10%  fio      [kernel.kallsyms]   [k] filemap_map_pages                                                                                                ▒
   1.99%  fio      libc-2.22.so        [.] __memcpy_avx_unaligned                                                                                           ▒
   1.46%  fio      fio                 [.] fill_random_buf                                                                                                  ▒
   0.02%  fio      fio                 [.] get_io_u                                                                                                         ▒
   0.02%  fio      fio                 [.] io_u_sync_complete      

Samples: 8M of event 'cycles:ppp', Event count (approx.): 163028870650810
Overhead  Command  Shared Object       Symbol
  87.68%  fio      fio                 [.] clock_thread_fn
   8.77%  fio      [kernel.kallsyms]   [k] run_timer_softirq
   1.96%  fio      libc-2.22.so        [.] __memcpy_avx_unaligned
   1.44%  fio      fio                 [.] fill_random_buf
   0.02%  fio      fio                 [.] get_io_u
   0.02%  fio      fio                 [.] io_u_sync_complete

 89.62 │       lock   cmpxchg %edi,(%rsi)                                                                                                                   ▒
       │     clock_thread_fn():                                                                                                                             ▒
       │                             seq = *t->seq;                                                                                                         ▒
       │                             if (seq == UINT_MAX)                                                                                                   ▒
       │                                     break;                                                                                                         ▒
       │                             __sync_synchronize();                                                                                                  ▒
       │                             tsc = get_cpu_clock();                                                                                                 ▒
       │                     } while (seq != atomic32_compare_and_swap(t->seq, seq, seq + 1));                                                              ▒
       │       cmp    %eax,%ecx                                                                                                                             ▒
       │     ↓ je     118                                                                                                                                   ◆
       │                     uint32_t seq;                                                                                                                  ▒
       │                     uint64_t tsc;                                                                                                                  ▒
       │                                                                                                                                                    ▒
       │                     c->cpu = t->cpu;                                                                                                               ▒
       │                     do {                                                                                                                           ▒
       │                             seq = *t->seq;                                                                                                         ▒
       │ c9:   mov    0x68(%rbx),%rcx                                                                                                                       ▒
 10.01 │       mov    (%rcx),%ecx        

May you please clarify, is it ok ?, what is the lock contention...

Anton                                                                                                      				  ▒                                         

-----Original Message-----
From: Gavriliuk, Anton (HPS Ukraine)
Sent: Sunday, December 03, 2017 5:24 AM
To: 'Jens Axboe' <axboe@kernel.dk>; Robert Elliott (Persistent Memory) <elliott@hpe.com>; Rebecca Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com>
Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com>
Subject: RE: fio 3.2

Ok, thanks.  Now I built latest fio available :-)

dl560g10spmem01:~ # /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --norandommap --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --size=290g --numa_cpu_nodes=0
4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1 ...
fio-3.2-64-gaf7fb
Starting 16 processes
Jobs: 16 (f=16): [m(16)][8.0%][r=2138MiB/s,w=39.7GiB/s][r=42.2k,w=799k IOPS][eta 01m:44s]
       
       From perf top,
       
Samples: 1M of event 'cycles:ppp', Event count (approx.): 503978848613
Overhead  Shared Object       Symbol                                                                                                                        ◆
  93.34%  libc-2.22.so        [.] __memcpy_avx_unaligned      

Just can't drill it down,

__memcpy_avx_unaligned  /lib64/libc-2.22.so Huh? No selection. Report to linux-kernel@vger.kernel.org 

While other functions can,

get_io_u  /usr/local/bin/fio
  0.08 │        xor    %eax,%esi
  0.01 │        shl    $0x6,%rsi
  0.02 │        and    $0x1c0,%esi
  0.00 │        sub    %rax,%rsi
  0.12 │        add    %rdx,%rsi
  0.00 │        sub    %rax,%rdx
       │        nop
       │                      /*
       │                       * Fill offset into start of cacheline, time into end
       │                       * of cacheline
       │                       */
       │                      iptr = (void *) p + (offset << 6);
       │                      *iptr = boffset;
 16.01 │ 230:   mov    %rax,(%rsi,%rax,1)
       │
       │                      iptr = (void *) p + 64 - 2 * sizeof(uint64_t);
       │                      iptr[0] = io_u->start_time.tv_sec;
  7.03 │        mov    0x0(%rbp),%rcx
 13.52 │        mov    %rcx,0x30(%rdx,%rax,1)                   

-----Original Message-----
From: Jens Axboe [mailto:axboe@kernel.dk]
Sent: Thursday, November 30, 2017 11:24 PM
To: Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com>; Robert Elliott (Persistent Memory) <elliott@hpe.com>; Rebecca Cran <rebecca@bluestop.org>; Sitsofe Wheeler <sitsofe@gmail.com>
Cc: fio@vger.kernel.org; Kani, Toshimitsu <toshi.kani@hpe.com>
Subject: Re: fio 3.2

On 11/30/2017 07:17 AM, Gavriliuk, Anton (HPS Ukraine) wrote:
> It's there any chance to fix it ?
> 
> dl560g10spmem01:/var/work # /usr/local/bin/fio --filename=/dev/dax0.0 
> --rw=randrw --refill_buffers --randrepeat=0 --ioengine=mmap
> --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5
> --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting 
> --name=4-rand-rw-3xx --size=290g
> 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 
> 4096B-256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1 ...
> fio-2.99
> Starting 16 processes
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0

Should already be fixed as of yesterday morning.

--
Jens Axboe


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: fio 3.2
  2017-11-30  4:21                     ` Jens Axboe
  2017-11-30 14:17                       ` Gavriliuk, Anton (HPS Ukraine)
@ 2017-11-30 16:13                       ` Jens Axboe
  2017-11-30 16:18                         ` Jens Axboe
  1 sibling, 1 reply; 39+ messages in thread
From: Jens Axboe @ 2017-11-30 16:13 UTC (permalink / raw)
  To: Elliott, Robert (Persistent Memory),
	Gavriliuk, Anton (HPS Ukraine), Rebecca Cran, Sitsofe Wheeler
  Cc: fio@vger.kernel.org, Kani, Toshimitsu

On 11/29/2017 09:21 PM, Jens Axboe wrote:
> On 11/28/2017 09:13 PM, Elliott, Robert (Persistent Memory) wrote:
>> small_content_scramble has hardly been touched since 2011, so it probably
>> hasn't had much performance analysis.  
> 
> That's fair, would be a good thing to look at, especially since it's on
> by default.

Something like this might be an improvement. The main change here is
that for each 512b chunk in the io_u buffer, we generate a "random"
index between 0..7 and scramble the start and end of that with the
offset and time. The difference here is that we do it all within a 64b
range within each chunk, which should fall into the same cacheline since
the io_u buffer is generally aligned. This cuts down on the number of
cachelines we dirty for each io_u, from a max of 16 to 8.

We could further reduce this to 7, if we generate an overlapping
cacheline between chunks. That would place the data in the same spot
everytime, which isn't ideal though.

I ran a quick null benchmark with this, and on my laptop it brings us
from 4061k/4068k to 4090k/4091k. Those are results from two runs, so not
very conclusive or definitive... Suggestions and tests would be welcome.
FWIW, this is what I ran:

./fio --name=null --size=100g --rw=write --ioengine=null --gtod_reduce=1 --scramble_buffers=1 --iodepth=64 --direct=1 --cpus_allowed=2


diff --git a/io_u.c b/io_u.c
index 086384a1c655..6bb9eabf1cb2 100644
--- a/io_u.c
+++ b/io_u.c
@@ -1669,32 +1669,40 @@ static bool check_get_verify(struct thread_data *td, struct io_u *io_u)
  */
 static void small_content_scramble(struct io_u *io_u)
 {
-	unsigned int i, nr_blocks = io_u->buflen / 512;
+	unsigned int i, nr_blocks = io_u->buflen >> 9;
 	unsigned int offset;
-	uint64_t boffset;
-	char *p, *end;
+	uint64_t boffset, *iptr;
+	char *p;
 
 	if (!nr_blocks)
 		return;
 
 	p = io_u->xfer_buf;
 	boffset = io_u->offset;
-	io_u->buf_filled_len = 0;
+
+	if (io_u->buf_filled_len)
+		io_u->buf_filled_len = 0;
+
+	/*
+	 * Generate random index between 0..7. We do chunks of 512b, if
+	 * we assume a cacheline is 64 bytes, then we have 8 of those.
+	 * Scramble content within the blocks in the same cacheline to
+	 * speed things up.
+	 */
+	offset = (io_u->start_time.tv_nsec ^ boffset) & 7;
 
 	for (i = 0; i < nr_blocks; i++) {
 		/*
-		 * Fill the byte offset into a "random" start offset of
-		 * the first half of the buffer.
+		 * Fill offset into start of cacheline, time into end
+		 * of cacheline
 		 */
-		offset = (io_u->start_time.tv_nsec ^ boffset) & 255;
-		offset &= ~(sizeof(boffset) - 1);
-		memcpy(p + offset, &boffset, sizeof(boffset));
+		iptr = (void *) p + (offset << 6);
+		*iptr = boffset;
+
+		iptr = (void *) p + 64 - 2 * sizeof(uint64_t);
+		iptr[0] = io_u->start_time.tv_sec;
+		iptr[1] = io_u->start_time.tv_nsec;
 
-		/*
-		 * Fill the start time into the end of the buffer
-		 */
-		end = p + 512 - sizeof(io_u->start_time);
-		memcpy(end, &io_u->start_time, sizeof(io_u->start_time));
 		p += 512;
 		boffset += 512;
 	}

-- 
Jens Axboe



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: fio 3.2
  2017-11-30 16:13                       ` Jens Axboe
@ 2017-11-30 16:18                         ` Jens Axboe
  0 siblings, 0 replies; 39+ messages in thread
From: Jens Axboe @ 2017-11-30 16:18 UTC (permalink / raw)
  To: Elliott, Robert (Persistent Memory),
	Gavriliuk, Anton (HPS Ukraine), Rebecca Cran, Sitsofe Wheeler
  Cc: fio@vger.kernel.org, Kani, Toshimitsu

On 11/30/2017 09:13 AM, Jens Axboe wrote:
> On 11/29/2017 09:21 PM, Jens Axboe wrote:
>> On 11/28/2017 09:13 PM, Elliott, Robert (Persistent Memory) wrote:
>>> small_content_scramble has hardly been touched since 2011, so it probably
>>> hasn't had much performance analysis.  
>>
>> That's fair, would be a good thing to look at, especially since it's on
>> by default.
> 
> Something like this might be an improvement. The main change here is
> that for each 512b chunk in the io_u buffer, we generate a "random"
> index between 0..7 and scramble the start and end of that with the
> offset and time. The difference here is that we do it all within a 64b
> range within each chunk, which should fall into the same cacheline since
> the io_u buffer is generally aligned. This cuts down on the number of
> cachelines we dirty for each io_u, from a max of 16 to 8.
> 
> We could further reduce this to 7, if we generate an overlapping
> cacheline between chunks. That would place the data in the same spot
> everytime, which isn't ideal though.
> 
> I ran a quick null benchmark with this, and on my laptop it brings us
> from 4061k/4068k to 4090k/4091k. Those are results from two runs, so not
> very conclusive or definitive... Suggestions and tests would be welcome.
> FWIW, this is what I ran:
> 
> ./fio --name=null --size=100g --rw=write --ioengine=null --gtod_reduce=1 --scramble_buffers=1 --iodepth=64 --direct=1 --cpus_allowed=2

Ran the same thing on a server box, which generated very reliable results.
It takes us from 3200K to 3340K. I'm going to commit this, but don't let
that stop anyone from testing it and making suggestions on how we can
further improve it.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 39+ messages in thread

[parent not found: <F5206176-8A92-46B4-9092-F901C132E22C@bluestop.org>]

* Re: fio 3.2
       [not found]           ` <F5206176-8A92-46B4-9092-F901C132E22C@bluestop.org>
@ 2017-11-28  8:38             ` Sitsofe Wheeler
  0 siblings, 0 replies; 39+ messages in thread
From: Sitsofe Wheeler @ 2017-11-28  8:38 UTC (permalink / raw)
  To: Rebecca Cran
  Cc: Gavriliuk, Anton (HPS Ukraine), fio@vger.kernel.org,
	Robert Elliott

Thanks Rebecca.

On 27 November 2017 at 19:57, Rebecca Cran <rebecca@bluestop.org> wrote:
> If you don’t have a pmem device, Linux can emulate one using RAM and the
> ‘memmap’ kernel parameter. See http://pmem.io/2016/02/22/pm-emulation.html
> for more info.
>
> Rebecca
>
> Sent from my iPhone
>
> On Nov 27, 2017, at 12:38 PM, Sitsofe Wheeler <sitsofe@gmail.com> wrote:
>
> Unfortunately I don't have access to a pmem device but let's see how far we
> get:

-- 
Sitsofe | http://sucs.org/~sits/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: fio 3.2
  2017-11-27 19:38         ` Sitsofe Wheeler
  2017-11-27 22:48           ` Rebecca Cran
       [not found]           ` <F5206176-8A92-46B4-9092-F901C132E22C@bluestop.org>
@ 2017-11-28 14:25           ` Gavriliuk, Anton (HPS Ukraine)
  2017-11-29  0:04             ` Sitsofe Wheeler
  2 siblings, 1 reply; 39+ messages in thread
From: Gavriliuk, Anton (HPS Ukraine) @ 2017-11-28 14:25 UTC (permalink / raw)
  To: Sitsofe Wheeler
  Cc: fio@vger.kernel.org, Brian Boylston,
	Elliott, Robert (Persistent Memory)

Ok, now I fixed that.  Seems to be sequence of the keys is important.  I fixed that.  Now I have next issue :-)

I trying to speed up customer’s script.

Initially it was with /dev/pmem & libaio engine,

dl560g10spmem01:~/FIO-PMEM_TestScripts/DAX # /usr/bin/fio --filename=/dev/pmem1 --direct=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw-3xx 4-rand-rw-3xx: (g=0): rw=randrw, bs=4K-256K/4K-256K/4K-256K, ioengine=libaio, iodepth=1
...
fio-2.12
Starting 16 processes
Jobs: 2 (f=2): [_(5),m(1),_(7),m(1),_(2)] [100.0% done] [357.4MB/6990MB/0KB /s] [7066/137K/0 iops] [eta 00m:00s]
4-rand-rw-3xx: (groupid=0, jobs=16): err= 0: pid=23144: Mon Nov 27 19:56:04 2017
  read : io=245977MB, bw=1592.8MB/s, iops=31315, runt=154436msec
    slat (usec): min=1, max=77, avg=10.34, stdev= 6.00
    clat (usec): min=0, max=6, avg= 0.20, stdev= 0.40
     lat (usec): min=1, max=77, avg=10.57, stdev= 6.01
    clat percentiles (usec):
     |  1.00th=[    0],  5.00th=[    0], 10.00th=[    0], 20.00th=[    0],
     | 30.00th=[    0], 40.00th=[    0], 50.00th=[    0], 60.00th=[    0],
     | 70.00th=[    0], 80.00th=[    1], 90.00th=[    1], 95.00th=[    1],
     | 99.00th=[    1], 99.50th=[    1], 99.90th=[    1], 99.95th=[    1],
     | 99.99th=[    1]
    bw (KB  /s): min=38528, max=118396, per=6.36%, avg=103720.16, stdev=8779.81
  write: io=4559.9GB, bw=30234MB/s, iops=594518, runt=154436msec
    slat (usec): min=1, max=787, avg=16.46, stdev=10.00
    clat (usec): min=0, max=680, avg= 0.20, stdev= 0.41
     lat (usec): min=1, max=789, avg=16.69, stdev=10.00
    clat percentiles (usec):
     |  1.00th=[    0],  5.00th=[    0], 10.00th=[    0], 20.00th=[    0],
     | 30.00th=[    0], 40.00th=[    0], 50.00th=[    0], 60.00th=[    0],
     | 70.00th=[    0], 80.00th=[    1], 90.00th=[    1], 95.00th=[    1],
     | 99.00th=[    1], 99.50th=[    1], 99.90th=[    1], 99.95th=[    1],
     | 99.99th=[    1]
    bw (MB  /s): min=  731, max= 2015, per=6.36%, avg=1922.69, stdev=146.41
    lat (usec) : 2=100.00%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
    lat (usec) : 250=0.01%, 750=0.01%
  cpu          : usr=35.61%, sys=64.37%, ctx=2509, majf=0, minf=930
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=4836244/w=91815119/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: io=245977MB, aggrb=1592.8MB/s, minb=1592.8MB/s, maxb=1592.8MB/s, mint=154436msec, maxt=154436msec
  WRITE: io=4559.9GB, aggrb=30234MB/s, minb=30234MB/s, maxb=30234MB/s, mint=154436msec, maxt=154436msec

Now I want saving customer’s I/O pattern, but switch to the DAX access (/dev/dax0.0 raw device) with mmap engine, trying to bypass OS block device & system call layers.  So I modified script,

dl560g10spmem01:~/FIO-PMEM_TestScripts/DAX # /usr/bin/fio --name=4-rand-rw-3xx --ioengine=mmap --iodepth=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --size=290g --numjobs=16 --group_reporting --runtime=120 --filename=/dev/dax0.0
4-rand-rw-3xx: (g=0): rw=randrw, bs=4K-256K/4K-256K/4K-256K, ioengine=mmap, iodepth=1
...
fio-2.12
Starting 16 processes
Jobs: 12 (f=12): [m(3),_(2),m(1),_(1),m(6),_(1),m(2)] [78.6% done] [1304MB/24907MB/0KB /s] [25.7K/490K/0 iops] [eta 00m:33s]
4-rand-rw-3xx: (groupid=0, jobs=16): err= 0: pid=65032: Tue Nov 28 06:58:09 2017
  read : io=202831MB, bw=1690.3MB/s, iops=33231, runt=120001msec
    clat (usec): min=0, max=211, avg= 8.88, stdev= 5.23
     lat (usec): min=0, max=211, avg= 8.92, stdev= 5.23
    clat percentiles (usec):
     |  1.00th=[    1],  5.00th=[    1], 10.00th=[    2], 20.00th=[    4],
     | 30.00th=[    6], 40.00th=[    8], 50.00th=[   10], 60.00th=[   11],
     | 70.00th=[   11], 80.00th=[   12], 90.00th=[   13], 95.00th=[   14],
     | 99.00th=[   29], 99.50th=[   43], 99.90th=[   48], 99.95th=[   49],
     | 99.99th=[   53]
    bw (KB  /s): min=86408, max=159184, per=6.36%, avg=110086.50, stdev=19409.13
  write: io=3764.3GB, bw=32121MB/s, iops=631607, runt=120001msec
    clat (usec): min=0, max=744, avg=15.16, stdev=10.09
     lat (usec): min=0, max=744, avg=15.21, stdev=10.09
    clat percentiles (usec):
     |  1.00th=[    1],  5.00th=[    2], 10.00th=[    3], 20.00th=[    6],
     | 30.00th=[   11], 40.00th=[   12], 50.00th=[   12], 60.00th=[   20],
     | 70.00th=[   21], 80.00th=[   22], 90.00th=[   23], 95.00th=[   24],
     | 99.00th=[   46], 99.50th=[   86], 99.90th=[   91], 99.95th=[   92],
     | 99.99th=[   96]
    bw (MB  /s): min= 1760, max= 2737, per=6.36%, avg=2043.01, stdev=353.05
    lat (usec) : 2=4.63%, 4=8.45%, 10=11.27%, 20=32.41%, 50=42.58%
    lat (usec) : 100=0.66%, 250=0.01%, 500=0.01%, 750=0.01%
  cpu          : usr=99.74%, sys=0.24%, ctx=2791, majf=0, minf=2822710
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=3987782/w=75793478/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: io=202831MB, aggrb=1690.3MB/s, minb=1690.3MB/s, maxb=1690.3MB/s, mint=120001msec, maxt=120001msec
  WRITE: io=3764.3GB, aggrb=32121MB/s, minb=32121MB/s, maxb=32121MB/s, mint=120001msec, maxt=120001msec

No huge growth detected – lat 16.69 vs 15.21, write bw – 30234Mb/s vs 32121Mb/s.

Looking to the performance traces,

PID 65043  /usr/bin/fio

    HARDCLOCK entries
       Count     Pct  State  Function
        2210  66.29%  USER   __memcpy_avx_unaligned [/lib64/libc-2.22.so]
        1122  33.65%  USER   UNKNOWN
           2   0.06%  SYS    pagerange_is_ram_callback

       Count     Pct  HARDCLOCK Stack trace
       ============================================================
           2   0.06%  pagerange_is_ram_callback  walk_system_ram_range  pat_pagerange_is_ram  lookup_memtype  track_pfn_insert  vmf_insert_pfn_pmd  dax_dev_pmd_fault  handle_mm_fault  __do_page_fault  do_page_fault  page_fault  unknown  |  __memcpy_avx_unaligned

   25.620925 cpu=51 pid=65043 tgid=65043 hardclock state=USER  [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6
   25.644924 cpu=51 pid=65043 tgid=65043 hardclock state=USER  [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6
   25.656923 cpu=51 pid=65043 tgid=65043 hardclock state=USER  [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6
   25.680925 cpu=51 pid=65043 tgid=65043 hardclock state=USER  [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6
   25.692924 cpu=51 pid=65043 tgid=65043 hardclock state=USER  [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6
   25.704925 cpu=51 pid=65043 tgid=65043 hardclock state=USER  [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6
   25.716925 cpu=51 pid=65043 tgid=65043 hardclock state=USER  [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6
   25.728923 cpu=51 pid=65043 tgid=65043 hardclock state=USER  [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6
   25.764924 cpu=51 pid=65043 tgid=65043 hardclock state=USER  [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6
   25.776924 cpu=51 pid=65043 tgid=65043 hardclock state=USER  [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6

Seems to be there lot's unaligned memory access, any ideas ?

You could also look at the source code for fio to see what it’s doing or put in some printf() statements so you can understand what the actual alignment is.

Anton

-----Original Message-----
From: Sitsofe Wheeler [mailto:sitsofe@gmail.com] 
Sent: Monday, November 27, 2017 9:38 PM
To: Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com>
Cc: fio@vger.kernel.org; Brian Boylston <brian.boylston@hpe.com>; Elliott, Robert (Persistent Memory) <elliott@hpe.com>
Subject: Re: fio 3.2

Unfortunately I don't have access to a pmem device but let's see how far we get:

On 27 November 2017 at 12:39, Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@hpe.com> wrote:
>
> result=$(fio --name=random-writers --ioengine=mmap --iodepth=32 
> --rw=randwrite --bs=64k --size=1024m --numjobs=8 --group_reporting=1 
> --eta=never --time_based --runtime=60 --filename=/dev/dax0.0 | grep 
> WRITE)

Please make your problem scenarios as simple as possible:
1. Just run fio normally so we can see the output it produces on both stdout and stderr normally.
2. Reduce the job back so it there are only the bare minimum options that reproduce the problem.
3. Try to avoid changing lots of things.

Here you've switched ioengine introducing another place to look.
Instead how about this:

Was your /dev/dax0.0 device made using ndctl? Assuming yes:

fio --name=dax-mmap --ioengine=mmap --rw=write --bs=64k --eta=never --time_based --runtime=60 --filename=/dev/dax0.0 --size=2g

(apparently a size has to be specified when you try to use a character device - see https://nvdimm.wiki.kernel.org/ )

If you run just that by itself what do you see?

Next up:

fio --name=dax-dax --ioengine=dev-dax --rw=write --bs=64k --eta=never --time_based --runtime=60 --filename=/dev/dax0.0 --size=2g

If you run just that by itself what do you see?

Finally:
Assuming a -o dax mounted filesystem on /pmem0/ :

fio --name=libpmemblk --ioengine=pmemblk --rw=write --bs=64k --eta=never --time_based --runtime=60
--filename=/pmem0/fio-test,4096,1024 --thread=1

If you run just that by itself what do you see?

(Perhaps the documentation for these ioengines and pmem devices needs to be improved?)

--
Sitsofe | http://sucs.org/~sits/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: fio 3.2
  2017-11-28 14:25           ` Gavriliuk, Anton (HPS Ukraine)
@ 2017-11-29  0:04             ` Sitsofe Wheeler
  0 siblings, 0 replies; 39+ messages in thread
From: Sitsofe Wheeler @ 2017-11-29  0:04 UTC (permalink / raw)
  To: Gavriliuk, Anton (HPS Ukraine)
  Cc: fio@vger.kernel.org, Elliott, Robert (Persistent Memory)

Anton,

Did you see Elliot's suggestions (i.e. point 4) in his previous email
(http://www.spinics.net/lists/fio/msg06561.html ) for speeding things
up and how you are at the mercy of glibc?

On 28 November 2017 at 14:25, Gavriliuk, Anton (HPS Ukraine)
<anton.gavriliuk@hpe.com> wrote:
> Ok, now I fixed that.  Seems to be sequence of the keys is important.  I fixed that.  Now I have next issue :-)
>
> I trying to speed up customer’s script.
[snip]
>      lat (usec): min=1, max=77, avg=10.57, stdev= 6.01
>     clat percentiles (usec):
>      |  1.00th=[    0],  5.00th=[    0], 10.00th=[    0], 20.00th=[    0],
>      | 30.00th=[    0], 40.00th=[    0], 50.00th=[    0], 60.00th=[    0],
>      | 70.00th=[    0], 80.00th=[    1], 90.00th=[    1], 95.00th=[    1],
>      | 99.00th=[    1], 99.50th=[    1], 99.90th=[    1], 99.95th=[    1],
>      | 99.99th=[    1]
>     bw (KB  /s): min=38528, max=118396, per=6.36%, avg=103720.16, stdev=8779.81
>   write: io=4559.9GB, bw=30234MB/s, iops=594518, runt=154436msec
>     slat (usec): min=1, max=787, avg=16.46, stdev=10.00
>     clat (usec): min=0, max=680, avg= 0.20, stdev= 0.41
>      lat (usec): min=1, max=789, avg=16.69, stdev=10.00
>     clat percentiles (usec):
>      |  1.00th=[    0],  5.00th=[    0], 10.00th=[    0], 20.00th=[    0],
>      | 30.00th=[    0], 40.00th=[    0], 50.00th=[    0], 60.00th=[    0],
>      | 70.00th=[    0], 80.00th=[    1], 90.00th=[    1], 95.00th=[    1],
>      | 99.00th=[    1], 99.50th=[    1], 99.90th=[    1], 99.95th=[    1],
>      | 99.99th=[    1]
>     bw (MB  /s): min=  731, max= 2015, per=6.36%, avg=1922.69, stdev=146.41
>     lat (usec) : 2=100.00%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
>     lat (usec) : 250=0.01%, 750=0.01%
>   cpu          : usr=35.61%, sys=64.37%, ctx=2509, majf=0, minf=930
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued    : total=r=4836244/w=91815119/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=1
>
> Run status group 0 (all jobs):
>    READ: io=245977MB, aggrb=1592.8MB/s, minb=1592.8MB/s, maxb=1592.8MB/s, mint=154436msec, maxt=154436msec
>   WRITE: io=4559.9GB, aggrb=30234MB/s, minb=30234MB/s, maxb=30234MB/s, mint=154436msec, maxt=154436msec
>
> Now I want saving customer’s I/O pattern, but switch to the DAX access (/dev/dax0.0 raw device) with mmap engine, trying to bypass OS block device & system call layers.  So I modified script,
>
> dl560g10spmem01:~/FIO-PMEM_TestScripts/DAX # /usr/bin/fio --name=4-rand-rw-3xx --ioengine=mmap --iodepth=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --size=290g --numjobs=16 --group_reporting --runtime=120 --filename=/dev/dax0.0
> 4-rand-rw-3xx: (g=0): rw=randrw, bs=4K-256K/4K-256K/4K-256K, ioengine=mmap, iodepth=1
> ...
> fio-2.12
> Starting 16 processes
> Jobs: 12 (f=12): [m(3),_(2),m(1),_(1),m(6),_(1),m(2)] [78.6% done] [1304MB/24907MB/0KB /s] [25.7K/490K/0 iops] [eta 00m:33s]
> 4-rand-rw-3xx: (groupid=0, jobs=16): err= 0: pid=65032: Tue Nov 28 06:58:09 2017
>   read : io=202831MB, bw=1690.3MB/s, iops=33231, runt=120001msec
>     clat (usec): min=0, max=211, avg= 8.88, stdev= 5.23
>      lat (usec): min=0, max=211, avg= 8.92, stdev= 5.23
>     clat percentiles (usec):
>      |  1.00th=[    1],  5.00th=[    1], 10.00th=[    2], 20.00th=[    4],
>      | 30.00th=[    6], 40.00th=[    8], 50.00th=[   10], 60.00th=[   11],
>      | 70.00th=[   11], 80.00th=[   12], 90.00th=[   13], 95.00th=[   14],
>      | 99.00th=[   29], 99.50th=[   43], 99.90th=[   48], 99.95th=[   49],
>      | 99.99th=[   53]
>     bw (KB  /s): min=86408, max=159184, per=6.36%, avg=110086.50, stdev=19409.13
>   write: io=3764.3GB, bw=32121MB/s, iops=631607, runt=120001msec
>     clat (usec): min=0, max=744, avg=15.16, stdev=10.09
>      lat (usec): min=0, max=744, avg=15.21, stdev=10.09
>     clat percentiles (usec):
>      |  1.00th=[    1],  5.00th=[    2], 10.00th=[    3], 20.00th=[    6],
>      | 30.00th=[   11], 40.00th=[   12], 50.00th=[   12], 60.00th=[   20],
>      | 70.00th=[   21], 80.00th=[   22], 90.00th=[   23], 95.00th=[   24],
>      | 99.00th=[   46], 99.50th=[   86], 99.90th=[   91], 99.95th=[   92],
>      | 99.99th=[   96]
>     bw (MB  /s): min= 1760, max= 2737, per=6.36%, avg=2043.01, stdev=353.05
>     lat (usec) : 2=4.63%, 4=8.45%, 10=11.27%, 20=32.41%, 50=42.58%
>     lat (usec) : 100=0.66%, 250=0.01%, 500=0.01%, 750=0.01%
>   cpu          : usr=99.74%, sys=0.24%, ctx=2791, majf=0, minf=2822710
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued    : total=r=3987782/w=75793478/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=1
>
> Run status group 0 (all jobs):
>    READ: io=202831MB, aggrb=1690.3MB/s, minb=1690.3MB/s, maxb=1690.3MB/s, mint=120001msec, maxt=120001msec
>   WRITE: io=3764.3GB, aggrb=32121MB/s, minb=32121MB/s, maxb=32121MB/s, mint=120001msec, maxt=120001msec
>
> No huge growth detected – lat 16.69 vs 15.21, write bw – 30234Mb/s vs 32121Mb/s.
>
> Looking to the performance traces,
>
> PID 65043  /usr/bin/fio
>
>     HARDCLOCK entries
>        Count     Pct  State  Function
>         2210  66.29%  USER   __memcpy_avx_unaligned [/lib64/libc-2.22.so]
>         1122  33.65%  USER   UNKNOWN
>            2   0.06%  SYS    pagerange_is_ram_callback
>
>        Count     Pct  HARDCLOCK Stack trace
>        ============================================================
>            2   0.06%  pagerange_is_ram_callback  walk_system_ram_range  pat_pagerange_is_ram  lookup_memtype  track_pfn_insert  vmf_insert_pfn_pmd  dax_dev_pmd_fault  handle_mm_fault  __do_page_fault  do_page_fault  page_fault  unknown  |  __memcpy_avx_unaligned
>
>    25.620925 cpu=51 pid=65043 tgid=65043 hardclock state=USER  [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6
>    25.644924 cpu=51 pid=65043 tgid=65043 hardclock state=USER  [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6
>    25.656923 cpu=51 pid=65043 tgid=65043 hardclock state=USER  [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6
>    25.680925 cpu=51 pid=65043 tgid=65043 hardclock state=USER  [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6
>    25.692924 cpu=51 pid=65043 tgid=65043 hardclock state=USER  [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6
>    25.704925 cpu=51 pid=65043 tgid=65043 hardclock state=USER  [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6
>    25.716925 cpu=51 pid=65043 tgid=65043 hardclock state=USER  [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6
>    25.728923 cpu=51 pid=65043 tgid=65043 hardclock state=USER  [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6
>    25.764924 cpu=51 pid=65043 tgid=65043 hardclock state=USER  [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6
>    25.776924 cpu=51 pid=65043 tgid=65043 hardclock state=USER  [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6
>
> Seems to be there lot's unaligned memory access, any ideas ?
>
> You could also look at the source code for fio to see what it’s doing or put in some printf() statements so you can understand what the actual alignment is.
>
> Anton

-- 
Sitsofe | http://sucs.org/~sits/

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2017-12-05 20:29 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <TU4PR8401MB099112614914A9232A07490DEC270@TU4PR8401MB0991.NAMPRD84.PROD.OUTLOOK.COM>
     [not found] ` <903d418b-bac4-104b-28e5-3c529efab7f5@kernel.dk>
2017-11-26  5:30   ` fio 3.2 Gavriliuk, Anton (HPS Ukraine)
2017-11-26 15:28     ` Sitsofe Wheeler
2017-11-27 12:39       ` Gavriliuk, Anton (HPS Ukraine)
2017-11-27 19:38         ` Sitsofe Wheeler
2017-11-27 22:48           ` Rebecca Cran
2017-11-28  1:12             ` Gavriliuk, Anton (HPS Ukraine)
2017-11-28  4:51               ` Elliott, Robert (Persistent Memory)
2017-11-29  3:35                 ` Gavriliuk, Anton (HPS Ukraine)
2017-11-29  4:13                   ` Elliott, Robert (Persistent Memory)
2017-11-29  4:44                     ` Rebecca Cran
2017-11-30  4:04                       ` Jens Axboe
2017-11-29  6:55                     ` Gavriliuk, Anton (HPS Ukraine)
2017-11-29  7:40                     ` Gavriliuk, Anton (HPS Ukraine)
2017-11-29  8:22                       ` Elliott, Robert (Persistent Memory)
2017-11-29 13:12                         ` Gavriliuk, Anton (HPS Ukraine)
2017-11-29 17:17                           ` Sitsofe Wheeler
2017-11-29 20:24                             ` Elliott, Robert (Persistent Memory)
2017-11-30  4:21                     ` Jens Axboe
2017-11-30 14:17                       ` Gavriliuk, Anton (HPS Ukraine)
2017-11-30 21:23                         ` Jens Axboe
2017-12-01  5:21                           ` Gavriliuk, Anton (HPS Ukraine)
2017-12-01  7:15                             ` Robert Elliott (Persistent Memory)
2017-12-03  9:35                               ` Sitsofe Wheeler
2017-12-03 17:10                                 ` Jens Axboe
2017-12-05 20:29                                   ` Elliott, Robert (Persistent Memory)
2017-12-03  3:24                           ` Gavriliuk, Anton (HPS Ukraine)
2017-12-04 17:12                           ` Gavriliuk, Anton (HPS Ukraine)
2017-12-04 17:30                             ` Robert Elliott (Persistent Memory)
2017-12-04 18:25                               ` Gavriliuk, Anton (HPS Ukraine)
2017-12-04 18:50                                 ` Jens Axboe
2017-12-04 19:17                                   ` Jeff Furlong
2017-12-04 19:29                                     ` Gavriliuk, Anton (HPS Ukraine)
2017-12-04 18:41                             ` Jens Axboe
2017-12-04 17:30                           ` Gavriliuk, Anton (HPS Ukraine)
2017-11-30 16:13                       ` Jens Axboe
2017-11-30 16:18                         ` Jens Axboe
     [not found]           ` <F5206176-8A92-46B4-9092-F901C132E22C@bluestop.org>
2017-11-28  8:38             ` Sitsofe Wheeler
2017-11-28 14:25           ` Gavriliuk, Anton (HPS Ukraine)
2017-11-29  0:04             ` Sitsofe Wheeler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.