From: Ted Ts'o <tytso@mit.edu>
To: Xupeng Yun <xupeng@xupeng.me>
Cc: Ext4 development <linux-ext4@vger.kernel.org>
Subject: Re: Bad performance of ext4 with kernel 3.0.17
Date: Thu, 1 Mar 2012 14:47:35 -0500 [thread overview]
Message-ID: <20120301194735.GD32588@thunk.org> (raw)
In-Reply-To: <CACaf2aaqxM86DtdZMaaQZfrC+WbLwPjOVd=LmVjk+TvfObYUzQ@mail.gmail.com>
Two things I'd try:
#1) If this is a freshly created file system, the kernel may be
initializing the inode table in the background, and this could be
interfering with your benchmark workload. To address this, you can
either (a) add the mount option noinititable, (b) add the mke2fs
option "-E lazy_itable_init=0" --- but this will cause the mke2fs to
take a lot longer, or (c) mount the file system and wait until
"dumpe2fs /dev/md3 | tail" shows that the last block group has the
ITABLE_ZEROED flag set. For benchmarking purposes on a scratch
workload, option (a) above is the fast thing to do.
#2) It could be that the file system is choosing blocks farther away
from the beginning of the disk, which is slower, whereas the fio on
the raw disk will use the blocks closest to the beginning of the disk,
which are the fastest one. You could try creating the file system so
it is only 10GB, and then try running fio on that small, truncated
file system, and see if that makes a difference.
- Ted
On Thu, Mar 01, 2012 at 01:31:58PM +0800, Xupeng Yun wrote:
> I just set up a new server (Gentoo 64bit with kernel 3.0.17) with 4 x
> 15000RPM SAS disks(sdc, sdd, sde and sdf), and created soft RAID10 on
> top of them, the partitions are aligned at 1MB:
>
> # fdisk -lu /dev/sd{c,e,d,f}
>
> Disk /dev/sdc: 600.1 GB, 600127266816 bytes
> 255 heads, 63 sectors/track, 72961 cylinders, total 1172123568 sectors
> Units = sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 512 bytes
> I/O size (minimum/optimal): 512 bytes / 512 bytes
> Disk identifier: 0xdd96eace
>
> Device Boot Start End Blocks Id System
> /dev/sdc1 2048 1172123567 586060760 fd Linux raid
> autodetect
>
> Disk /dev/sde: 600.1 GB, 600127266816 bytes
> 3 heads, 63 sectors/track, 6201712 cylinders, total 1172123568 sectors
> Units = sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 512 bytes
> I/O size (minimum/optimal): 512 bytes / 512 bytes
> Disk identifier: 0xf869ba1c
>
> Device Boot Start End Blocks Id System
> /dev/sde1 2048 1172123567 586060760 fd Linux raid
> autodetect
>
> Disk /dev/sdd: 600.1 GB, 600127266816 bytes
> 81 heads, 63 sectors/track, 229693 cylinders, total 1172123568 sectors
> Units = sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 512 bytes
> I/O size (minimum/optimal): 512 bytes / 512 bytes
> Disk identifier: 0xf869ba1c
>
> Device Boot Start End Blocks Id System
> /dev/sdd1 2048 1172123567 586060760 fd Linux raid
> autodetect
>
> Disk /dev/sdf: 600.1 GB, 600127266816 bytes
> 81 heads, 63 sectors/track, 229693 cylinders, total 1172123568 sectors
> Units = sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 512 bytes
> I/O size (minimum/optimal): 512 bytes / 512 bytes
> Disk identifier: 0xb4893c3c
>
> Device Boot Start End Blocks Id System
> /dev/sdf1 2048 1172123567 586060760 fd Linux raid
> autodetect
>
>
> and here is the RAID 10 (md3) with 64K chunk size:
>
> cat /proc/mdstat
> Personalities : [raid0] [raid1] [raid10]
> md3 : active raid10 sdf1[3] sde1[2] sdd1[1] sdc1[0]
> 1172121344 blocks 64K chunks 2 near-copies [4/4] [UUUU]
>
> md1 : active raid1 sda1[0] sdb1[1]
> 112320 blocks [2/2] [UU]
>
> md2 : active raid1 sda2[0] sdb2[1]
> 41953664 blocks [2/2] [UU]
>
> unused devices: <none>
>
> I did IO testing with `fio` against the raw RAID device (md3), and the
> result looks good(read IOPS 1723 / write IOPS 168):
>
> # fio --filename=/dev/md3 --direct=1 --rw=randrw --bs=16k
> --size=5G --numjobs=16 --runtime=60 --group_reporting --name=file1
> --rwmixread=90 --thread --ioengine=p
> file1: (g=0): rw=randrw, bs=16K-16K/16K-16K, ioengine=psync, iodepth=1
> ...
> file1: (g=0): rw=randrw, bs=16K-16K/16K-16K, ioengine=psync, iodepth=1
> fio 2.0.3
> Starting 16 threads
> Jobs: 16 (f=16): [mmmmmmmmmmmmmmmm] [100.0% done] [28234K/2766K
> /s] [1723 /168 iops] [eta 00m:00s]
> file1: (groupid=0, jobs=16): err= 0: pid=17107
> read : io=1606.3MB, bw=27406KB/s, iops=1712 , runt= 60017msec
> clat (usec): min=221 , max=123233 , avg=7693.00, stdev=7734.82
> lat (usec): min=221 , max=123233 , avg=7693.12, stdev=7734.82
> clat percentiles (usec):
> | 1.00th=[ 1128], 5.00th=[ 1560], 10.00th=[ 1928], 20.00th=[ 2640],
> | 30.00th=[ 3376], 40.00th=[ 4128], 50.00th=[ 4896], 60.00th=[ 6304],
> | 70.00th=[ 8256], 80.00th=[11200], 90.00th=[16768], 95.00th=[23168],
> | 99.00th=[38656], 99.50th=[45824], 99.90th=[62720]
> bw (KB/s) : min= 888, max=13093, per=7.59%, avg=2079.11, stdev=922.54
> write: io=183840KB, bw=3063.2KB/s, iops=191 , runt= 60017msec
> clat (msec): min=1 , max=153 , avg=14.70, stdev=14.59
> lat (msec): min=1 , max=153 , avg=14.70, stdev=14.59
> clat percentiles (usec):
> | 1.00th=[ 1816], 5.00th=[ 2544], 10.00th=[ 3248], 20.00th=[ 4512],
> | 30.00th=[ 5728], 40.00th=[ 7648], 50.00th=[ 9536], 60.00th=[12480],
> | 70.00th=[16320], 80.00th=[22144], 90.00th=[32640], 95.00th=[43264],
> | 99.00th=[71168], 99.50th=[82432], 99.90th=[111104]
> bw (KB/s) : min= 90, max= 5806, per=33.81%, avg=1035.45, stdev=973.10
> lat (usec) : 250=0.05%, 500=0.09%, 750=0.05%, 1000=0.19%
> lat (msec) : 2=9.61%, 4=26.05%, 10=38.46%, 20=16.82%, 50=8.02%
> lat (msec) : 100=0.63%, 250=0.03%
> cpu : usr=1.02%, sys=2.87%, ctx=1926728, majf=0, minf=288891
> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%,
> 32=0.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> 64=0.0%, >=64=0.0%
> issued : total=r=102801/w=11490/d=0, short=r=0/w=0/d=0
>
> Run status group 0 (all jobs):
> READ: io=1606.3MB, aggrb=27405KB/s, minb=28063KB/s,
> maxb=28063KB/s, mint=60017msec, maxt=60017msec
> WRITE: io=183840KB, aggrb=3063KB/s, minb=3136KB/s,
> maxb=3136KB/s, mint=60017msec, maxt=60017msec
>
> Disk stats (read/write):
> md3: ios=102753/11469, merge=0/0, ticks=0/0, in_queue=0,
> util=0.00%, aggrios=25764/5746, aggrmerge=0/0, aggrticks=197378/51351,
> aggrin_queue=248718, aggrutil=99.31%
> sdc: ios=26256/5723, merge=0/0, ticks=204328/68364,
> in_queue=272668, util=99.20%
> sdd: ios=25290/5723, merge=0/0, ticks=187572/61628,
> in_queue=249188, util=98.73%
> sde: ios=25689/5769, merge=0/0, ticks=197340/71828,
> in_queue=269172, util=99.31%
> sdf: ios=25822/5769, merge=0/0, ticks=200272/3584,
> in_queue=203844, util=97.87%
>
> then I created ext4 filesystem on top of the RAID device and mounted
> it to /mnt/test:
>
> mkfs.ext4 -E stride=16,stripe-width=32 /dev/md3
> mount /dev/md3 /mnt/test -o noatime,nodiratime,data=writeback,nobarrier
>
> after that I did the very same IO testing, but the result looks very
> bad(read IOPS 926 / write IOPS 97):
>
> # fio --filename=/mnt/test/test --direct=1 --rw=randrw --bs=16k
> --size=5G --numjobs=16 --runtime=60 --group_reporting --name=file1
> --rwmixread=90 --thread --ioengine=psync
> file1: (g=0): rw=randrw, bs=16K-16K/16K-16K, ioengine=psync, iodepth=1
> ...
> file1: (g=0): rw=randrw, bs=16K-16K/16K-16K, ioengine=psync, iodepth=1
> fio 2.0.3
> Starting 16 threads
> file1: Laying out IO file(s) (1 file(s) / 5120MB)
> Jobs: 16 (f=16): [mmmmmmmmmmmmmmmm] [100.0% done] [15172K/1604K
> /s] [926 /97 iops] [eta 00m:00s]
> file1: (groupid=0, jobs=16): err= 0: pid=18764
> read : io=838816KB, bw=13974KB/s, iops=873 , runt= 60025msec
> clat (usec): min=228 , max=111583 , avg=16412.46, stdev=11632.03
> lat (usec): min=228 , max=111583 , avg=16412.60, stdev=11632.03
> clat percentiles (usec):
> | 1.00th=[ 1384], 5.00th=[ 2320], 10.00th=[ 3376], 20.00th=[ 5216],
> | 30.00th=[ 8256], 40.00th=[11456], 50.00th=[14656], 60.00th=[17792],
> | 70.00th=[21376], 80.00th=[25472], 90.00th=[32128], 95.00th=[37632],
> | 99.00th=[50944], 99.50th=[56576], 99.90th=[70144]
> bw (KB/s) : min= 308, max= 4448, per=6.90%, avg=964.30, stdev=339.53
> write: io=94208KB, bw=1569.5KB/s, iops=98 , runt= 60025msec
> clat (msec): min=1 , max=89 , avg=16.91, stdev=10.24
> lat (msec): min=1 , max=89 , avg=16.92, stdev=10.24
> clat percentiles (usec):
> | 1.00th=[ 2384], 5.00th=[ 3888], 10.00th=[ 5088], 20.00th=[ 7776],
> | 30.00th=[10304], 40.00th=[12736], 50.00th=[15296], 60.00th=[17792],
> | 70.00th=[20864], 80.00th=[24960], 90.00th=[30848], 95.00th=[35584],
> | 99.00th=[47360], 99.50th=[51456], 99.90th=[62208]
> bw (KB/s) : min= 31, max= 4676, per=62.37%, avg=978.64, stdev=896.53
> lat (usec) : 250=0.01%, 500=0.03%, 750=0.01%, 1000=0.06%
> lat (msec) : 2=3.15%, 4=9.42%, 10=22.23%, 20=31.61%, 50=32.39%
> lat (msec) : 100=1.08%, 250=0.01%
> cpu : usr=0.59%, sys=2.63%, ctx=1700318, majf=0, minf=19888
> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%,
> 32=0.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> 64=0.0%, >=64=0.0%
> issued : total=r=52426/w=5888/d=0, short=r=0/w=0/d=0
>
> Run status group 0 (all jobs):
> READ: io=838816KB, aggrb=13974KB/s, minb=14309KB/s,
> maxb=14309KB/s, mint=60025msec, maxt=60025msec
> WRITE: io=94208KB, aggrb=1569KB/s, minb=1607KB/s, maxb=1607KB/s,
> mint=60025msec, maxt=60025msec
>
> Disk stats (read/write):
> md3: ios=58848/13987, merge=0/0, ticks=0/0, in_queue=0,
> util=0.00%, aggrios=14750/4159, aggrmerge=0/2861,
> aggrticks=112418/28260, aggrin_queue=140664, aggrutil=84.95%
> sdc: ios=17688/4221, merge=0/2878, ticks=148664/37972,
> in_queue=186628, util=84.95%
> sdd: ios=11801/4219, merge=0/2880, ticks=79396/29192,
> in_queue=108572, util=70.71%
> sde: ios=16427/4099, merge=0/2843, ticks=129072/35252,
> in_queue=164304, util=81.57%
> sdf: ios=13086/4097, merge=0/2845, ticks=92540/10624,
> in_queue=103152, util=60.02%
>
> anything goes wrong here?
>
>
> --
> Xupeng Yun
> http://about.me/xupeng
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2012-03-01 19:47 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-03-01 5:31 Bad performance of ext4 with kernel 3.0.17 Xupeng Yun
2012-03-01 19:47 ` Ted Ts'o [this message]
2012-03-02 0:50 ` Xupeng Yun
2012-03-02 2:45 ` Ted Ts'o
2012-03-02 7:06 ` Xupeng Yun
2012-03-03 3:56 ` Xupeng Yun
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120301194735.GD32588@thunk.org \
--to=tytso@mit.edu \
--cc=linux-ext4@vger.kernel.org \
--cc=xupeng@xupeng.me \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).