linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Reduced latency is killing performance
@ 2016-11-10 16:04 Hannes Reinecke
  2016-11-10 16:36 ` Bart Van Assche
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Hannes Reinecke @ 2016-11-10 16:04 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig; +Cc: SCSI Mailing List, linux-block

[-- Attachment #1: Type: text/plain, Size: 2164 bytes --]

Hi all,

this really feels like a follow-up to the discussion we've had in Santa
Fe, but finally I'm able to substantiate it with some numbers.

I've made a patch to enable the megaraid_sas driver for multiqueue.
While this is pretty straightforward (I'll be sending the patchset later
on), the results are ... interesting.

I've run the 'ssd-test.fio' script from Jens' repository, and these
results for MQ/SQ (- is mq, + is sq):

 Run status group 0 (all jobs):
-   READ: io=10641MB, aggrb=181503KB/s, minb=181503KB/s,
maxb=181503KB/s, mint=60033msec, maxt=60033msec
+   READ: io=18370MB, aggrb=312572KB/s, minb=312572KB/s,
maxb=312572KB/s, mint=60181msec, maxt=60181msec

 Run status group 1 (all jobs):
-   READ: io=441444KB, aggrb=7303KB/s, minb=7303KB/s, maxb=7303KB/s,
mint=60443msec, maxt=60443msec
+   READ: io=223108KB, aggrb=3707KB/s, minb=3707KB/s, maxb=3707KB/s,
mint=60182msec, maxt=60182msec

 Run status group 2 (all jobs):
-  WRITE: io=22485MB, aggrb=383729KB/s, minb=383729KB/s,
maxb=383729KB/s, mint=60001msec, maxt=60001msec
+  WRITE: io=47421MB, aggrb=807581KB/s, minb=807581KB/s,
maxb=807581KB/s, mint=60129msec, maxt=60129msec

 Run status group 3 (all jobs):
-  WRITE: io=489852KB, aggrb=8110KB/s, minb=8110KB/s, maxb=8110KB/s,
mint=60399msec, maxt=60399msec
+  WRITE: io=489748KB, aggrb=8134KB/s, minb=8134KB/s, maxb=8134KB/s,
mint=60207msec, maxt=60207msec

 Disk stats (read/write):
-  sda: ios=2834412/5878578, merge=0/0, ticks=86269292/48364836,
in_queue=135345876, util=99.20%
+  sda: ios=205278/2680329, merge=4552593/9580622,
ticks=12539912/12965228, in_queue=25512312, util=99.59%

As you can see, we're really losing performance in the multiqueue case.
And the main reason for that is that we submit about _10 times_ as much
I/O as we do for the single-queue case.

So I guess having an I/O scheduler is critical, even for the scsi-mq case.

Full fio output attached.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

[-- Attachment #2: ssd-test.fio --]
[-- Type: text/plain, Size: 857 bytes --]

# Do some important numbers on SSD drives, to gauge what kind of
# performance you might get out of them.
#
# Sequential read and write speeds are tested, these are expected to be
# high. Random reads should also be fast, random writes are where crap
# drives are usually separated from the good drives.
#
# This uses a queue depth of 4. New SATA SSD's will support up to 32
# in flight commands, so it may also be interesting to increase the queue
# depth and compare. Note that most real-life usage will not see that
# large of a queue depth, so 4 is more representative of normal use.
#
[global]
ioengine=libaio
iodepth=256
size=10g
direct=1
runtime=60
numjobs=16
directory=/home
filename=ssd.test.file
group_reporting

[seq-read]
rw=read
stonewall

[rand-read]
rw=randread
stonewall

[seq-write]
rw=write
stonewall

[rand-write]
rw=randwrite
stonewall


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: megasas-sq.log --]
[-- Type: text/x-log; name="megasas-sq.log", Size: 6400 bytes --]

seq-read: (g=0): rw=read, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=256
...
rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=256
...
seq-write: (g=2): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=256
...
rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=256
...
fio-2.15
Starting 64 processes

seq-read: (groupid=0, jobs=16): err= 0: pid=3715: Thu Nov 10 16:48:49 2016
  read : io=18370MB, bw=312573KB/s, iops=78143, runt= 60181msec
    slat (usec): min=1, max=5664, avg= 2.33, stdev= 5.74
    clat (usec): min=30, max=1283.3K, avg=52320.04, stdev=114862.40
     lat (usec): min=37, max=1283.3K, avg=52322.47, stdev=114862.35
    clat percentiles (usec):
     |  1.00th=[  620],  5.00th=[ 1432], 10.00th=[ 2096], 20.00th=[ 2544],
     | 30.00th=[ 2608], 40.00th=[ 2640], 50.00th=[ 2672], 60.00th=[ 2736],
     | 70.00th=[ 2864], 80.00th=[14912], 90.00th=[284672], 95.00th=[342016],
     | 99.00th=[419840], 99.50th=[468992], 99.90th=[585728], 99.95th=[634880],
     | 99.99th=[749568]
    lat (usec) : 50=0.01%, 100=0.01%, 250=0.20%, 500=0.47%, 750=1.07%
    lat (usec) : 1000=0.90%
    lat (msec) : 2=5.92%, 4=63.81%, 10=5.22%, 20=3.28%, 50=1.84%
    lat (msec) : 100=1.19%, 250=4.25%, 500=11.54%, 750=0.28%, 1000=0.01%
    lat (msec) : 2000=0.01%
  cpu          : usr=0.65%, sys=1.42%, ctx=78762, majf=0, minf=9372
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=4702738/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=256
rand-read: (groupid=1, jobs=16): err= 0: pid=3772: Thu Nov 10 16:48:49 2016
  read : io=223108KB, bw=3707.3KB/s, iops=926, runt= 60182msec
    slat (usec): min=2, max=1131.3K, avg=17217.00, stdev=95963.21
    clat (msec): min=16, max=5492, avg=4255.60, stdev=671.40
     lat (msec): min=171, max=6020, avg=4272.81, stdev=675.11
    clat percentiles (msec):
     |  1.00th=[  717],  5.00th=[ 3195], 10.00th=[ 4080], 20.00th=[ 4228],
     | 30.00th=[ 4359], 40.00th=[ 4359], 50.00th=[ 4424], 60.00th=[ 4424],
     | 70.00th=[ 4490], 80.00th=[ 4490], 90.00th=[ 4555], 95.00th=[ 4752],
     | 99.00th=[ 4817], 99.50th=[ 4948], 99.90th=[ 5473], 99.95th=[ 5473],
     | 99.99th=[ 5473]
    lat (msec) : 20=0.01%, 100=0.02%, 250=0.24%, 500=0.40%, 750=0.40%
    lat (msec) : 1000=0.46%, 2000=1.61%, >=2000=96.87%
  cpu          : usr=0.02%, sys=0.04%, ctx=1929, majf=0, minf=4259
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=0.5%, 32=0.9%, >=64=98.2%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=55777/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=256
seq-write: (groupid=2, jobs=16): err= 0: pid=3790: Thu Nov 10 16:48:49 2016
  write: io=47421MB, bw=807582KB/s, iops=201895, runt= 60129msec
    slat (usec): min=2, max=3039, avg= 4.85, stdev=16.02
    clat (usec): min=95, max=794322, avg=20251.85, stdev=67558.84
     lat (usec): min=123, max=794326, avg=20256.83, stdev=67559.50
    clat percentiles (usec):
     |  1.00th=[  900],  5.00th=[  932], 10.00th=[  956], 20.00th=[  996],
     | 30.00th=[ 1048], 40.00th=[ 1112], 50.00th=[ 1160], 60.00th=[ 1224],
     | 70.00th=[ 1400], 80.00th=[ 1720], 90.00th=[ 2960], 95.00th=[232448],
     | 99.00th=[296960], 99.50th=[313344], 99.90th=[366592], 99.95th=[489472],
     | 99.99th=[774144]
    lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.09%, 1000=20.47%
    lat (msec) : 2=64.34%, 4=5.85%, 10=0.24%, 20=0.33%, 50=0.81%
    lat (msec) : 100=0.54%, 250=3.00%, 500=4.26%, 750=0.02%, 1000=0.03%
  cpu          : usr=1.41%, sys=5.44%, ctx=558403, majf=0, minf=12550
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=0/w=12139774/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=256
rand-write: (groupid=3, jobs=16): err= 0: pid=3806: Thu Nov 10 16:48:49 2016
  write: io=489748KB, bw=8134.5KB/s, iops=2033, runt= 60207msec
    slat (usec): min=3, max=330352, avg=7843.31, stdev=43990.65
    clat (msec): min=65, max=3671, avg=2001.23, stdev=448.67
     lat (msec): min=68, max=3671, avg=2009.07, stdev=449.18
    clat percentiles (msec):
     |  1.00th=[  285],  5.00th=[  693], 10.00th=[ 1680], 20.00th=[ 1991],
     | 30.00th=[ 2040], 40.00th=[ 2089], 50.00th=[ 2114], 60.00th=[ 2147],
     | 70.00th=[ 2180], 80.00th=[ 2245], 90.00th=[ 2343], 95.00th=[ 2376],
     | 99.00th=[ 2540], 99.50th=[ 2606], 99.90th=[ 2704], 99.95th=[ 2737],
     | 99.99th=[ 2868]
    lat (msec) : 100=0.08%, 250=0.68%, 500=1.72%, 750=3.60%, 1000=1.28%
    lat (msec) : 2000=14.04%, >=2000=78.60%
  cpu          : usr=0.03%, sys=0.14%, ctx=11009, majf=0, minf=172
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.4%, >=64=99.2%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=0/w=122437/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
   READ: io=18370MB, aggrb=312572KB/s, minb=312572KB/s, maxb=312572KB/s, mint=60181msec, maxt=60181msec

Run status group 1 (all jobs):
   READ: io=223108KB, aggrb=3707KB/s, minb=3707KB/s, maxb=3707KB/s, mint=60182msec, maxt=60182msec

Run status group 2 (all jobs):
  WRITE: io=47421MB, aggrb=807581KB/s, minb=807581KB/s, maxb=807581KB/s, mint=60129msec, maxt=60129msec

Run status group 3 (all jobs):
  WRITE: io=489748KB, aggrb=8134KB/s, minb=8134KB/s, maxb=8134KB/s, mint=60207msec, maxt=60207msec

Disk stats (read/write):
  sda: ios=205278/2680329, merge=4552593/9580622, ticks=12539912/12965228, in_queue=25512312, util=99.59%

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #4: megasas-mq.log --]
[-- Type: text/x-log; name="megasas-mq.log", Size: 6429 bytes --]

seq-read: (g=0): rw=read, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=256
...
rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=256
...
seq-write: (g=2): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=256
...
rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=256
...
fio-2.15
Starting 64 processes

seq-read: (groupid=0, jobs=16): err= 0: pid=12365: Thu Nov 10 16:34:03 2016
  read : io=10641MB, bw=181503KB/s, iops=45375, runt= 60033msec
    slat (usec): min=2, max=168902, avg=349.71, stdev=2179.65
    clat (usec): min=27, max=1046.3K, avg=89851.03, stdev=94205.19
     lat (usec): min=921, max=1046.3K, avg=90200.97, stdev=94519.88
    clat percentiles (msec):
     |  1.00th=[    6],  5.00th=[   11], 10.00th=[   16], 20.00th=[   23],
     | 30.00th=[   32], 40.00th=[   43], 50.00th=[   57], 60.00th=[   77],
     | 70.00th=[  101], 80.00th=[  139], 90.00th=[  210], 95.00th=[  289],
     | 99.00th=[  445], 99.50th=[  506], 99.90th=[  644], 99.95th=[  709],
     | 99.99th=[  881]
    lat (usec) : 50=0.01%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.22%, 10=3.90%, 20=12.47%, 50=29.05%
    lat (msec) : 100=23.95%, 250=23.49%, 500=6.38%, 750=0.51%, 1000=0.03%
    lat (msec) : 2000=0.01%
  cpu          : usr=0.38%, sys=1.53%, ctx=1261134, majf=0, minf=7465
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=2724047/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=256
rand-read: (groupid=1, jobs=16): err= 0: pid=12424: Thu Nov 10 16:34:03 2016
  read : io=441444KB, bw=7303.5KB/s, iops=1825, runt= 60443msec
    slat (usec): min=3, max=5654.6K, avg=8705.16, stdev=98890.89
    clat (usec): min=499, max=54071K, avg=2020668.34, stdev=4066596.80
     lat (usec): min=534, max=54071K, avg=2029373.76, stdev=4080684.63
    clat percentiles (msec):
     |  1.00th=[   58],  5.00th=[   96], 10.00th=[  120], 20.00th=[  163],
     | 30.00th=[  219], 40.00th=[  310], 50.00th=[  482], 60.00th=[  881],
     | 70.00th=[ 1598], 80.00th=[ 2868], 90.00th=[ 5604], 95.00th=[ 8455],
     | 99.00th=[16712], 99.50th=[16712], 99.90th=[16712], 99.95th=[16712],
     | 99.99th=[16712]
    lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.03%, 20=0.08%, 50=0.47%
    lat (msec) : 100=5.07%, 250=28.49%, 500=16.49%, 750=6.91%, 1000=4.65%
    lat (msec) : 2000=11.35%, >=2000=26.44%
  cpu          : usr=0.03%, sys=0.10%, ctx=12943, majf=0, minf=4261
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.5%, >=64=99.1%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=110361/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=256
seq-write: (groupid=2, jobs=16): err= 0: pid=12440: Thu Nov 10 16:34:03 2016
  write: io=22485MB, bw=383729KB/s, iops=95932, runt= 60001msec
    slat (usec): min=4, max=25089, avg=160.99, stdev=186.82
    clat (usec): min=9, max=99208, avg=42511.29, stdev=8094.51
     lat (usec): min=285, max=99223, avg=42672.98, stdev=8120.87
    clat percentiles (usec):
     |  1.00th=[27008],  5.00th=[35072], 10.00th=[35584], 20.00th=[37120],
     | 30.00th=[37632], 40.00th=[38656], 50.00th=[40192], 60.00th=[41728],
     | 70.00th=[43776], 80.00th=[47872], 90.00th=[54528], 95.00th=[60160],
     | 99.00th=[67072], 99.50th=[70144], 99.90th=[76288], 99.95th=[78336],
     | 99.99th=[84480]
    lat (usec) : 10=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.33%, 50=83.21%
    lat (msec) : 100=16.42%
  cpu          : usr=3.17%, sys=19.97%, ctx=10291638, majf=0, minf=28791
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=0/w=5756037/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=256
rand-write: (groupid=3, jobs=16): err= 0: pid=12456: Thu Nov 10 16:34:03 2016
  write: io=489852KB, bw=8110.3KB/s, iops=2027, runt= 60399msec
    slat (usec): min=4, max=7321.2K, avg=7849.80, stdev=103311.36
    clat (msec): min=30, max=49117, avg=2007.24, stdev=3917.47
     lat (msec): min=30, max=49143, avg=2015.09, stdev=3929.12
    clat percentiles (msec):
     |  1.00th=[   73],  5.00th=[  120], 10.00th=[  147], 20.00th=[  194],
     | 30.00th=[  253], 40.00th=[  322], 50.00th=[  469], 60.00th=[  766],
     | 70.00th=[ 1352], 80.00th=[ 2704], 90.00th=[ 5473], 95.00th=[ 9372],
     | 99.00th=[16712], 99.50th=[16712], 99.90th=[16712], 99.95th=[16712],
     | 99.99th=[16712]
    lat (msec) : 50=0.18%, 100=2.31%, 250=27.05%, 500=22.01%, 750=8.03%
    lat (msec) : 1000=4.76%, 2000=11.62%, >=2000=24.04%
  cpu          : usr=0.05%, sys=0.16%, ctx=23792, majf=0, minf=181
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.4%, >=64=99.2%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=0/w=122463/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
   READ: io=10641MB, aggrb=181503KB/s, minb=181503KB/s, maxb=181503KB/s, mint=60033msec, maxt=60033msec

Run status group 1 (all jobs):
   READ: io=441444KB, aggrb=7303KB/s, minb=7303KB/s, maxb=7303KB/s, mint=60443msec, maxt=60443msec

Run status group 2 (all jobs):
  WRITE: io=22485MB, aggrb=383729KB/s, minb=383729KB/s, maxb=383729KB/s, mint=60001msec, maxt=60001msec

Run status group 3 (all jobs):
  WRITE: io=489852KB, aggrb=8110KB/s, minb=8110KB/s, maxb=8110KB/s, mint=60399msec, maxt=60399msec

Disk stats (read/write):
  sda: ios=2834412/5878578, merge=0/0, ticks=86269292/48364836, in_queue=135345876, util=99.20%

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Reduced latency is killing performance
  2016-11-10 16:04 Reduced latency is killing performance Hannes Reinecke
@ 2016-11-10 16:36 ` Bart Van Assche
  2016-11-10 16:48   ` Jens Axboe
  2016-11-10 16:47 ` Jens Axboe
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 8+ messages in thread
From: Bart Van Assche @ 2016-11-10 16:36 UTC (permalink / raw)
  To: Jens Axboe, Hannes Reinecke, Christoph Hellwig
  Cc: SCSI Mailing List, linux-block, Damien Le Moal

On 11/10/2016 08:04 AM, Hannes Reinecke wrote:
> this really feels like a follow-up to the discussion we've had in Santa
> Fe, but finally I'm able to substantiate it with some numbers.

Hi Jens,

Should I send you the notes I took on Thursday morning November 3 at 10 
AM such that you can publish these?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Reduced latency is killing performance
  2016-11-10 16:04 Reduced latency is killing performance Hannes Reinecke
  2016-11-10 16:36 ` Bart Van Assche
@ 2016-11-10 16:47 ` Jens Axboe
  2016-11-11 10:26   ` Kashyap Desai
  2016-11-11  7:02 ` Elliott, Robert (Persistent Memory)
  2016-11-11 17:15 ` Bart Van Assche
  3 siblings, 1 reply; 8+ messages in thread
From: Jens Axboe @ 2016-11-10 16:47 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: SCSI Mailing List, linux-block

On 11/10/2016 09:04 AM, Hannes Reinecke wrote:
> Hi all,
>
> this really feels like a follow-up to the discussion we've had in Santa
> Fe, but finally I'm able to substantiate it with some numbers.
>
> I've made a patch to enable the megaraid_sas driver for multiqueue.
> While this is pretty straightforward (I'll be sending the patchset later
> on), the results are ... interesting.
>
> I've run the 'ssd-test.fio' script from Jens' repository, and these
> results for MQ/SQ (- is mq, + is sq):
>
>  Run status group 0 (all jobs):
> -   READ: io=10641MB, aggrb=181503KB/s, minb=181503KB/s,
> maxb=181503KB/s, mint=60033msec, maxt=60033msec
> +   READ: io=18370MB, aggrb=312572KB/s, minb=312572KB/s,
> maxb=312572KB/s, mint=60181msec, maxt=60181msec
>
>  Run status group 1 (all jobs):
> -   READ: io=441444KB, aggrb=7303KB/s, minb=7303KB/s, maxb=7303KB/s,
> mint=60443msec, maxt=60443msec
> +   READ: io=223108KB, aggrb=3707KB/s, minb=3707KB/s, maxb=3707KB/s,
> mint=60182msec, maxt=60182msec
>
>  Run status group 2 (all jobs):
> -  WRITE: io=22485MB, aggrb=383729KB/s, minb=383729KB/s,
> maxb=383729KB/s, mint=60001msec, maxt=60001msec
> +  WRITE: io=47421MB, aggrb=807581KB/s, minb=807581KB/s,
> maxb=807581KB/s, mint=60129msec, maxt=60129msec
>
>  Run status group 3 (all jobs):
> -  WRITE: io=489852KB, aggrb=8110KB/s, minb=8110KB/s, maxb=8110KB/s,
> mint=60399msec, maxt=60399msec
> +  WRITE: io=489748KB, aggrb=8134KB/s, minb=8134KB/s, maxb=8134KB/s,
> mint=60207msec, maxt=60207msec
>
>  Disk stats (read/write):
> -  sda: ios=2834412/5878578, merge=0/0, ticks=86269292/48364836,
> in_queue=135345876, util=99.20%
> +  sda: ios=205278/2680329, merge=4552593/9580622,
> ticks=12539912/12965228, in_queue=25512312, util=99.59%
>
> As you can see, we're really losing performance in the multiqueue case.
> And the main reason for that is that we submit about _10 times_ as much
> I/O as we do for the single-queue case.

What's the setup like? I'm going to need more details.

The baseline test is using the legacy path, single queue. The new test
is multiqueue, scsi-mq. What's sda?

> So I guess having an I/O scheduler is critical, even for the scsi-mq case.

Each of these sections is a single job. For some reason we are not
merging as well as we should, that's the reason for the performance
loss. In fact, we're not merging at all. That's not IO scheduling.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Reduced latency is killing performance
  2016-11-10 16:36 ` Bart Van Assche
@ 2016-11-10 16:48   ` Jens Axboe
  0 siblings, 0 replies; 8+ messages in thread
From: Jens Axboe @ 2016-11-10 16:48 UTC (permalink / raw)
  To: Bart Van Assche, Hannes Reinecke, Christoph Hellwig
  Cc: SCSI Mailing List, linux-block, Damien Le Moal

On 11/10/2016 09:36 AM, Bart Van Assche wrote:
> On 11/10/2016 08:04 AM, Hannes Reinecke wrote:
>> this really feels like a follow-up to the discussion we've had in Santa
>> Fe, but finally I'm able to substantiate it with some numbers.
>
> Hi Jens,
>
> Should I send you the notes I took on Thursday morning November 3 at 10
> AM such that you can publish these?

I don't think it's related to this thread, but yes, that would be great.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: Reduced latency is killing performance
  2016-11-10 16:04 Reduced latency is killing performance Hannes Reinecke
  2016-11-10 16:36 ` Bart Van Assche
  2016-11-10 16:47 ` Jens Axboe
@ 2016-11-11  7:02 ` Elliott, Robert (Persistent Memory)
  2016-11-11 11:39   ` Hannes Reinecke
  2016-11-11 17:15 ` Bart Van Assche
  3 siblings, 1 reply; 8+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2016-11-11  7:02 UTC (permalink / raw)
  To: Hannes Reinecke, Jens Axboe, Christoph Hellwig
  Cc: SCSI Mailing List, linux-block@vger.kernel.org



> -----Original Message-----
> From: linux-block-owner@vger.kernel.org [mailto:linux-block-
> owner@vger.kernel.org] On Behalf Of Hannes Reinecke
> Sent: Thursday, November 10, 2016 10:05 AM
> To: Jens Axboe <axboe@kernel.dk>; Christoph Hellwig <hch@lst.de>
> Cc: SCSI Mailing List <linux-scsi@vger.kernel.org>; linux-
> block@vger.kernel.org
> Subject: Reduced latency is killing performance
> 
> Hi all,
> 
> this really feels like a follow-up to the discussion we've had in
> Santa Fe, but finally I'm able to substantiate it with some numbers.
> 
> I've made a patch to enable the megaraid_sas driver for multiqueue.
> While this is pretty straightforward (I'll be sending the patchset
> later on), the results are ... interesting.
> 
> I've run the 'ssd-test.fio' script from Jens' repository, and these
> results for MQ/SQ (- is mq, + is sq):
> 
>  Run status group 0 (all jobs):  [4 KiB sequential reads]
> -   READ: io=10641MB, aggrb=181503KB/s
> +   READ: io=18370MB, aggrb=312572KB/s
> 
>  Run status group 1 (all jobs):  [4 KiB random reads]
> -   READ: io=441444KB, aggrb=7303KB/s
> +   READ: io=223108KB, aggrb=3707KB/s
> 
>  Run status group 2 (all jobs):  [4 KiB sequential writes]
> -  WRITE: io=22485MB, aggrb=383729KB/s
> +  WRITE: io=47421MB, aggrb=807581KB/s
> 
>  Run status group 3 (all jobs):  [4 KiB random writes]
> -  WRITE: io=489852KB, aggrb=8110KB/s
> +  WRITE: io=489748KB, aggrb=8134KB/s
> 
>  Disk stats (read/write):
> -  sda: ios=2834412/5878578, merge=0/0
> +  sda: ios=205278/2680329, merge=4552593/9580622

[deleted minb, maxb, mint, maxt, ticks, in_queue, and util above]

> 
> As you can see, we're really losing performance in the multiqueue
> case.
> And the main reason for that is that we submit about _10 times_ as
> much I/O as we do for the single-queue case.

That script is running:
0) 4 KiB sequential reads
1) 4 KiB random reads
2) 4 KiB sequential writes
3) 4 KiB random writes

I think you're just seeing a lack of merges for the tiny sequential
workloads.  Those are the ones where mq has lower aggrb results.

Check the value in /sys/block/sda/queue/nomerges. The values are
    0=search for fast and slower merges
    1=only attempt fast merges
    2=don't attempt any merges

The SNIA Enterprise Solid State Storage Performance Test Specification
(SSS PTS-E) only measures 128 KiB and 1 MiB sequential IOs - it doesn't
test tiny sequential IOs.  Applications may do anything, but I think
most understand that fewer, bigger transfers are more efficient 
throughout the IO stack.  A blocksize of 128 KiB would reduce those
IOs by 96%.

For hpsa, we often turned them off to avoid the overhead while running
applications generating decent-sized IOs on their own.

Note that the random read aggrb value doubled with mq, and random
writes showed no impact.

You might also want to set 
    cpus_allowed_policy=split
to keep threads from wandering across CPUs (and thus changing queues).

> So I guess having an I/O scheduler is critical, even for the scsi-mq
> case.

blk-mq still supports merges without any scheduler.

---
Robert Elliott, HPE Persistent Memory




^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: Reduced latency is killing performance
  2016-11-10 16:47 ` Jens Axboe
@ 2016-11-11 10:26   ` Kashyap Desai
  0 siblings, 0 replies; 8+ messages in thread
From: Kashyap Desai @ 2016-11-11 10:26 UTC (permalink / raw)
  To: Jens Axboe, Hannes Reinecke, Christoph Hellwig
  Cc: SCSI Mailing List, linux-block

> -----Original Message-----
> From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
> owner@vger.kernel.org] On Behalf Of Jens Axboe
> Sent: Thursday, November 10, 2016 10:18 PM
> To: Hannes Reinecke; Christoph Hellwig
> Cc: SCSI Mailing List; linux-block@vger.kernel.org
> Subject: Re: Reduced latency is killing performance
>
> On 11/10/2016 09:04 AM, Hannes Reinecke wrote:
> > Hi all,
> >
> > this really feels like a follow-up to the discussion we've had in
> > Santa Fe, but finally I'm able to substantiate it with some numbers.
> >
> > I've made a patch to enable the megaraid_sas driver for multiqueue.
> > While this is pretty straightforward (I'll be sending the patchset
> > later on), the results are ... interesting.
> >
> > I've run the 'ssd-test.fio' script from Jens' repository, and these
> > results for MQ/SQ (- is mq, + is sq):
> >
> >  Run status group 0 (all jobs):
> > -   READ: io=10641MB, aggrb=181503KB/s, minb=181503KB/s,
> > maxb=181503KB/s, mint=60033msec, maxt=60033msec
> > +   READ: io=18370MB, aggrb=312572KB/s, minb=312572KB/s,
> > maxb=312572KB/s, mint=60181msec, maxt=60181msec
> >
> >  Run status group 1 (all jobs):
> > -   READ: io=441444KB, aggrb=7303KB/s, minb=7303KB/s, maxb=7303KB/s,
> > mint=60443msec, maxt=60443msec
> > +   READ: io=223108KB, aggrb=3707KB/s, minb=3707KB/s, maxb=3707KB/s,
> > mint=60182msec, maxt=60182msec
> >
> >  Run status group 2 (all jobs):
> > -  WRITE: io=22485MB, aggrb=383729KB/s, minb=383729KB/s,
> > maxb=383729KB/s, mint=60001msec, maxt=60001msec
> > +  WRITE: io=47421MB, aggrb=807581KB/s, minb=807581KB/s,
> > maxb=807581KB/s, mint=60129msec, maxt=60129msec
> >
> >  Run status group 3 (all jobs):
> > -  WRITE: io=489852KB, aggrb=8110KB/s, minb=8110KB/s, maxb=8110KB/s,
> > mint=60399msec, maxt=60399msec
> > +  WRITE: io=489748KB, aggrb=8134KB/s, minb=8134KB/s, maxb=8134KB/s,
> > mint=60207msec, maxt=60207msec
> >
> >  Disk stats (read/write):
> > -  sda: ios=2834412/5878578, merge=0/0, ticks=86269292/48364836,
> > in_queue=135345876, util=99.20%
> > +  sda: ios=205278/2680329, merge=4552593/9580622,
> > ticks=12539912/12965228, in_queue=25512312, util=99.59%
> >
> > As you can see, we're really losing performance in the multiqueue case.
> > And the main reason for that is that we submit about _10 times_ as
> > much I/O as we do for the single-queue case.
>
> What's the setup like? I'm going to need more details.
>
> The baseline test is using the legacy path, single queue. The new test is
> multiqueue, scsi-mq. What's sda?

Hannes -

Please share setup/config details so that I can also validate and post my
findings.

` Kashyap

>
> > So I guess having an I/O scheduler is critical, even for the scsi-mq
> > case.
>
> Each of these sections is a single job. For some reason we are not merging
> as
> well as we should, that's the reason for the performance loss. In fact,
> we're not
> merging at all. That's not IO scheduling.
>
> --
> Jens Axboe
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of
> a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Reduced latency is killing performance
  2016-11-11  7:02 ` Elliott, Robert (Persistent Memory)
@ 2016-11-11 11:39   ` Hannes Reinecke
  0 siblings, 0 replies; 8+ messages in thread
From: Hannes Reinecke @ 2016-11-11 11:39 UTC (permalink / raw)
  To: Elliott, Robert (Persistent Memory), Jens Axboe,
	Christoph Hellwig
  Cc: SCSI Mailing List, linux-block@vger.kernel.org

On 11/11/2016 08:02 AM, Elliott, Robert (Persistent Memory) wrote:
>
>
>> -----Original Message-----
>> From: linux-block-owner@vger.kernel.org [mailto:linux-block-
>> owner@vger.kernel.org] On Behalf Of Hannes Reinecke
>> Sent: Thursday, November 10, 2016 10:05 AM
>> To: Jens Axboe <axboe@kernel.dk>; Christoph Hellwig <hch@lst.de>
>> Cc: SCSI Mailing List <linux-scsi@vger.kernel.org>; linux-
>> block@vger.kernel.org
>> Subject: Reduced latency is killing performance
>>
>> Hi all,
>>
>> this really feels like a follow-up to the discussion we've had in
>> Santa Fe, but finally I'm able to substantiate it with some numbers.
>>
>> I've made a patch to enable the megaraid_sas driver for multiqueue.
>> While this is pretty straightforward (I'll be sending the patchset
>> later on), the results are ... interesting.
>>
>> I've run the 'ssd-test.fio' script from Jens' repository, and these
>> results for MQ/SQ (- is mq, + is sq):
>>
>>  Run status group 0 (all jobs):  [4 KiB sequential reads]
>> -   READ: io=10641MB, aggrb=181503KB/s
>> +   READ: io=18370MB, aggrb=312572KB/s
>>
>>  Run status group 1 (all jobs):  [4 KiB random reads]
>> -   READ: io=441444KB, aggrb=7303KB/s
>> +   READ: io=223108KB, aggrb=3707KB/s
>>
>>  Run status group 2 (all jobs):  [4 KiB sequential writes]
>> -  WRITE: io=22485MB, aggrb=383729KB/s
>> +  WRITE: io=47421MB, aggrb=807581KB/s
>>
>>  Run status group 3 (all jobs):  [4 KiB random writes]
>> -  WRITE: io=489852KB, aggrb=8110KB/s
>> +  WRITE: io=489748KB, aggrb=8134KB/s
>>
>>  Disk stats (read/write):
>> -  sda: ios=2834412/5878578, merge=0/0
>> +  sda: ios=205278/2680329, merge=4552593/9580622
>
> [deleted minb, maxb, mint, maxt, ticks, in_queue, and util above]
>
>>
>> As you can see, we're really losing performance in the multiqueue
>> case.
>> And the main reason for that is that we submit about _10 times_ as
>> much I/O as we do for the single-queue case.
>
> That script is running:
> 0) 4 KiB sequential reads
> 1) 4 KiB random reads
> 2) 4 KiB sequential writes
> 3) 4 KiB random writes
>
> I think you're just seeing a lack of merges for the tiny sequential
> workloads.  Those are the ones where mq has lower aggrb results.
>
Yep.

> Check the value in /sys/block/sda/queue/nomerges. The values are
>     0=search for fast and slower merges
>     1=only attempt fast merges
>     2=don't attempt any merges
>
It's set to '0'.

> The SNIA Enterprise Solid State Storage Performance Test Specification
> (SSS PTS-E) only measures 128 KiB and 1 MiB sequential IOs - it doesn't
> test tiny sequential IOs.  Applications may do anything, but I think
> most understand that fewer, bigger transfers are more efficient
> throughout the IO stack.  A blocksize of 128 KiB would reduce those
> IOs by 96%.
>
Note: it's just the test which has been named 'SSD'. The devices 
themselves were no SSDs; just normal disks.

> For hpsa, we often turned them off to avoid the overhead while running
> applications generating decent-sized IOs on their own.
>
> Note that the random read aggrb value doubled with mq, and random
> writes showed no impact.
>
> You might also want to set
>     cpus_allowed_policy=split
> to keep threads from wandering across CPUs (and thus changing queues).
>
Done so; no difference.

>> So I guess having an I/O scheduler is critical, even for the scsi-mq
>> case.
>
> blk-mq still supports merges without any scheduler.
>
But it doesn't _do_ merging, as the example nicely shows.

So if we could get merging going we should be halfway there ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Reduced latency is killing performance
  2016-11-10 16:04 Reduced latency is killing performance Hannes Reinecke
                   ` (2 preceding siblings ...)
  2016-11-11  7:02 ` Elliott, Robert (Persistent Memory)
@ 2016-11-11 17:15 ` Bart Van Assche
  3 siblings, 0 replies; 8+ messages in thread
From: Bart Van Assche @ 2016-11-11 17:15 UTC (permalink / raw)
  To: Hannes Reinecke, Jens Axboe, Christoph Hellwig
  Cc: SCSI Mailing List, linux-block

On 11/10/2016 08:04 AM, Hannes Reinecke wrote:
> this really feels like a follow-up to the discussion we've had in Santa
> Fe, but finally I'm able to substantiate it with some numbers.

Hello Hannes,

Since for the sequential read test the difference between the 50th 
percentile sequential read completion time (2.7 ms) and the average 
completion time (52 ms) is large I assume that you ran the ssd-test.fio 
job on top of a hard disk or hard disk array? How many hardware queues 
were active during your test? If multiple hardware queues were active, 
can you repeat your test with a single hardware queue? The plan for 
adding I/O scheduling support to blk-mq is - at least initially - only 
to perform I/O scheduling if there is only a single hardware queue.

Bart.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-11-11 17:15 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-11-10 16:04 Reduced latency is killing performance Hannes Reinecke
2016-11-10 16:36 ` Bart Van Assche
2016-11-10 16:48   ` Jens Axboe
2016-11-10 16:47 ` Jens Axboe
2016-11-11 10:26   ` Kashyap Desai
2016-11-11  7:02 ` Elliott, Robert (Persistent Memory)
2016-11-11 11:39   ` Hannes Reinecke
2016-11-11 17:15 ` Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).