* small logs that fit in RAID controller cache are (sometimes) faster
@ 2008-10-22 7:40 Peter Cordes
2008-10-22 21:59 ` Dave Chinner
0 siblings, 1 reply; 2+ messages in thread
From: Peter Cordes @ 2008-10-22 7:40 UTC (permalink / raw)
To: xfs
I've been playing with external vs. internal logs, and log sizes, on
a HW RAID controller w/256MB of battery-backed cache. I'm getting the
impression that for a single-threaded untar workload at least, a small
log (5 or 20MB) is actually faster 128MB. I think when the log is
small enough relative to the RAID controller's write-back cache, the
writes don't have to wait for actual disk. It's (maybe) like having
the log on a ramdisk. Battery-backed write caches on RAID controllers
make nobarrier safe (right?), so that's what I was using.
With a 20MB external log (on an otherwise-idle RAID1, while the
filestem is on a RAID6), during a small-file portion of the tar
extraction, I saw log write speeds of over 100MB/s (average over 10s,
with dstat, which is like iostat) 1 second peaks of 115MB/s. The
disk's max sequential write is more like 80MB/s, which is the peak I
was seeing with a 128MB log.
The only workload I'm testing with varying log sizes is untarring
/home onto a fresh filesystem, so single-threaded creation of 121744
files totalling 12.1GB. There were a few very large directories
(10000 files) of mostly small files (25% < 512B, 50% < 1.5k, 95% <
32k, 99.5% < 100k)[1]. There are also some large files on the FS
(maybe an ISO image or something). Watching disk throughput during
the untar, there is a log-bound portion and a large-file portion that
hardly has any log I/O. (The tar was compressed with lzop to 6.9GB,
and read over gigE NFS. I primed the cache so there is no NFS traffic
for about the first half of the test, and it's never read-limited.
Max NFS read speed is > 100MB/s, and I never saw more than 50MB/s
during the untar. (NFS server is one of the 16GB-RAM compute nodes, so
easily caching the whole file for when the writes evict some of the
tar.lzo from the cache on the master node.) Highest user-time CPU
usage is about half of a single CPU, system time never hit a full CPU
either, so those weren't limiting factors.
untar times:
110s w/ a 128MB internal log (note that the max sequential write is
at least ~3x higher for the RAID6 the internal log is on)
116s w/ a 128MB external log
113s with a 96MB external log
110s with a 64MB external log
115s with a 20MB external log
110s with a 5MB external log
(although IIRC I did see 105s with a 5 or 20MB external log. Sorry,
wasn't recording all the timings I did, since I wasn't planning to
post... Also, sometimes I was seeing 130s times with the 128MB
external log, but maybe that was before I was so consistent with my
method for priming the cache, so it was being read-limited... Now I
cat twice to /dev/null, then dd bs=1024k count=3200 to /dev/null to
make sure the first half of the file really wants to stay in the cache.)
I'm thinking of creating the FS with a 96MB external log, but I
haven't tested with parallel workloads with logs < 128MB. This RAID
array also holds /home (on the first 650GB of array), while this FS
(/data) is the last 2.2TB. /home uses an internal log, so if
someone's keeping /home busy and someone else is keeping /data busy,
their logs won't both be going to the RAID6. (I also use agcount=8 on
/data, agcount=7 on /home. Although I/O will probably be bursty, and
not loading /home and /data at the same time, my main concern is to
avoid scattering files in too many places. fewer AGs = more locality
for a bunch of sequentially created files, right? I have 8 spindles
in the RAID, so that seemed like a good amount of AGs. RAID6 doesn't
like small scattered writes _at_ _all_.)
Basically, I really want to ask if my mkfs parameters look good
before I let users start filling up the filesystems and running jobs
on the cluster, making it hard to redo the mkfs. Am I doing anything
that looks silly?
mkfs: (from xfsprogs 2.10.1 patches with the sw+agcount usability fix)
mkfs.xfs -i attr=2 -d su=64k,sw=6,agcount=8 -n size=16k /dev/sdb2 -L data \
-l lazy-count=1,logdev=/dev/raid1/data-xfslog,size=$((1024*1024*96))
meta-data=/dev/sdb2 isize=256 agcount=8, agsize=70172688 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=561381403, imaxpct=5
= sunit=16 swidth=96 blks
naming =version 2 bsize=16384 ascii-ci=0
log =/dev/raid1/data-xfslog bsize=4096 blocks=24576, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
mount -o noatime,usrquota,nobarrier,inode64,largeio,swalloc,logbsize=256k,logdev=/dev/raid1/data-xfslog
noikeep is the default, since I don't use dmapi, otherwise I would
include it, too. And yes, the XFS log is on LVM, near the start of
the disk.
BTW, is there a way to mount the root filesystem with inode64? If I
put that in my fstab, it only tries to do it with a remount (which
doesn't work), because I guess Ubuntu's initrd doesn't have mount
options from /etc/fstab. I made my root fs 10GB, and with imaxpct=8,
so the 32bit inode allocator should do ok.
I found -n size=16k helps with random deletion speed, IIRC. I only
use that on the big RAID6 filesystems, not / or /var/tmp, since users
won't be making their huge data file directories anywhere else.
Bonnie++ ends up with highly fragmented directories (xfs_bmap) after
random-order file creation.
/dev/sdb1 and /dev/sdb2 both have their starts full-stripe aligned
(I used parted, and set unit s (sectors), since my strip width is
768sectors. I used a GPT disklabel, since DOS wraps at 2TB)
The hardware is a Dell PERC 6/E on an 8-disk 500GB (Seagate) SATA
7200RPM RAID6 (PERC 6/E = LSI MegaRAID SAS 1078 w/256MB of battery
backed cache. It talks to an External box of hard drives (MD1000)
over a 4-lane (IIRC) SAS cable.) The computer is a Dell PE1950
(equipped with a single quad core 2GHz Harpertown CPU (12MB cache
total), with 8GB of DDR2-667MHz on a 5000X chipset). Linux 2.6.27
(Ubuntu Intrepid server kernel, which uses the deadline scheduler by
default. Otherwise running Ubuntu AMD64 Hardy (current stable), but
it's new hardware that's better supported by a newer kernel.)
The RAID1 (holding the external log) is on a PERC 6/I (similar
hardware, but hooked up to the interal SATA backplane in the server,
instead of the external MD1000.) I'm not sure if it's a separate card
with its own 256MB of cache, or if it shares the 256MB with the RAID6
on the PERC 6/E. (I opened up the case when the machine first arrived
(just sight-seeing, of course :), but I don't think I looked carefully
enough to rule out there being another RAID controller somewhere...)
If it's shared, that would explain why 128MB slows down. Otherwise, I
don't know, since 128MB should still fit easily in the 256MB of cache.
The RAID1 and RAID6 are both set for adaptive readahead in the
controller BIOS, and with a 64kB stripe size on the RAID6. I use
blockdev --setra 512 /dev/sda # the internal RAID1
blockdev --setra 8192 /dev/sdb # the RAID6
I don't have time to edit my other benchmark numbers into a presentable
format right now (I have lots of bonnie++ single-threaded and parallel
results for various XFS parameters), but if anyone wants them, email
me and I'll post my notes.
[1] This is pretty typical of what phylogenetics (trying to make
evolutionary trees based on DNA sequences of current life) grad
students and profs do on the cluster: generate output file for
combinations of (random subsets of) data sets and parameters.
Plus they compile software and extract tar files of source code, etc.
Probably we're most likely to be I/O bound when reading, since most
number crunching takes a lot of CPU for each output file, but
summarizing the data at the end usually involves reading all the
output files. But I don't want it to be slow when people untar
something, or whatever. The machine is an NFS server (over bonded
gigE) for a 650GB /home and a 2.2TB /data both on the same 8-disk
RAID6. 12 compute nodes will run jobs that write to those FSes
(usually slowly), while interactive users will mostly be local on the
master node.
The large amount of small files people usually generate is why I
didn't go with LUSTRE. It, or PVFS, might still be useful to make a
scratch filesystem out of the 500GB disks in each compute node,
though. :)
Thanks,
--
#define X(x,y) x##y
Peter Cordes ; e-mail: X(peter@cor , des.ca)
"The gods confound the man who first found out how to distinguish the hours!
Confound him, too, who in this place set up a sundial, to cut and hack
my day so wretchedly into small pieces!" -- Plautus, 200 BC
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: small logs that fit in RAID controller cache are (sometimes) faster
2008-10-22 7:40 small logs that fit in RAID controller cache are (sometimes) faster Peter Cordes
@ 2008-10-22 21:59 ` Dave Chinner
0 siblings, 0 replies; 2+ messages in thread
From: Dave Chinner @ 2008-10-22 21:59 UTC (permalink / raw)
To: Peter Cordes; +Cc: xfs
On Wed, Oct 22, 2008 at 04:40:31AM -0300, Peter Cordes wrote:
> I've been playing with external vs. internal logs, and log sizes, on
> a HW RAID controller w/256MB of battery-backed cache. I'm getting the
> impression that for a single-threaded untar workload at least, a small
> log (5 or 20MB) is actually faster 128MB. I think when the log is
> small enough relative to the RAID controller's write-back cache, the
> writes don't have to wait for actual disk. It's (maybe) like having
> the log on a ramdisk. Battery-backed write caches on RAID controllers
> make nobarrier safe (right?), so that's what I was using.
[snip lots of stuff]
Basically it comes down to the fact that the default (internal 128MB
log) is pretty much the fastest option when you have a battery
backed cache, right?
> untar times:
> 110s w/ a 128MB internal log (note that the max sequential write is
> at least ~3x higher for the RAID6 the internal log is on)
> 116s w/ a 128MB external log
> 113s with a 96MB external log
> 110s with a 64MB external log
> 115s with a 20MB external log
> 110s with a 5MB external log
In general, a small log is going to limit your transaction
parallelism. You are running a single threaded workload, so that's
not a big deal. Note that with a 16k directory block size, the
transaction reservation sizes are in the order of 3MB, which means
that with a 5MB log you'll completely serialise directory
modifications. IOWs, don't do it.
Larger logs are better because they allow many transactions to be
executed in parallel - something that happens alot on a fileserver.
Also, large logs allow writeback of metadata to be avoided when it
is frequently modified and relogged. Metadata writeback is
typically a small random write pattern....
> I'm thinking of creating the FS with a 96MB external log, but I
> haven't tested with parallel workloads with logs < 128MB. This RAID
> array also holds /home (on the first 650GB of array), while this FS
> (/data) is the last 2.2TB. /home uses an internal log, so if
> someone's keeping /home busy and someone else is keeping /data busy,
> their logs won't both be going to the RAID6.
That's pretty much irrelevant if you have a couple hundred MB of
write cache - the write cache will buffer each sequential log I/O
stream until they span full RAID stripes and then it will sync them
to disk as efficiently as possible. That's why the 128MB internal
log performs so well......
> (I also use agcount=8 on
> /data, agcount=7 on /home. Although I/O will probably be bursty, and
> not loading /home and /data at the same time, my main concern is to
> avoid scattering files in too many places. fewer AGs = more locality
> for a bunch of sequentially created files, right? I have 8 spindles
> in the RAID, so that seemed like a good amount of AGs. RAID6 doesn't
> like small scattered writes _at_ _all_.)
AGs are for allocation parallelism. That is, only one allocation or
free can be taking place in an AG at once. If you have lots of
threads banging on the filesystem, you need enough AGs to prevent
modifcation of the free space trees being the bottleneck. The data
in each AG will be packed as tightly as possible....
BTW, there is no real correlation between spindle count and # of
AGs for stripe based volumes as all AGs span all spindles. If you
have a linear concatenation, then you have the case where the
number of AGs should match or be a multiple of the number of
spindles. The allows AGs to operate completely independently
of each other....
> Basically, I really want to ask if my mkfs parameters look good
> before I let users start filling up the filesystems and running jobs
> on the cluster, making it hard to redo the mkfs. Am I doing anything
> that looks silly?
Oh, it's a cluster that will be banging on the filesystem? That's
a parallel workload. See above comments about log size and agcount
for parallelism. ;)
> BTW, is there a way to mount the root filesystem with inode64? If I
> put that in my fstab, it only tries to do it with a remount (which
> doesn't work), because I guess Ubuntu's initrd doesn't have mount
> options from /etc/fstab. I made my root fs 10GB, and with imaxpct=8,
> so the 32bit inode allocator should do ok.
The inode64 allocator is used on all filesystems smaller than 1TB.
inode32 only takes over when filesystems grow larger than that.
> I found -n size=16k helps with random deletion speed, IIRC. I only
Helps with lots of directory operations - the btrees are wider and
shallower which means less I/O for lookups, less allocations and
deletions of btree blocks, etc.
> use that on the big RAID6 filesystems, not / or /var/tmp, since users
> won't be making their huge data file directories anywhere else.
> Bonnie++ ends up with highly fragmented directories (xfs_bmap) after
> random-order file creation.
And the larger directory block sizes reduces fragmentation....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2008-10-22 21:58 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-22 7:40 small logs that fit in RAID controller cache are (sometimes) faster Peter Cordes
2008-10-22 21:59 ` Dave Chinner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox