Re: allocsize mount option

From: Dave Chinner <david@fromorbit.com>
To: Gim Leong Chin <chingimleong@yahoo.com.sg>
Cc: Eric Sandeen <sandeen@sandeen.net>, xfs@oss.sgi.com
Subject: Re: allocsize mount option
Date: Wed, 13 Jan 2010 21:50:18 +1100	[thread overview]
Message-ID: <20100113105018.GP17483@discord.disaster> (raw)
In-Reply-To: <348821.11898.qm@web76206.mail.sg1.yahoo.com>

On Wed, Jan 13, 2010 at 05:42:16PM +0800, Gim Leong Chin wrote:
> Hi,
> 
> 
> The application is ANSYS, which writes 128 GB files.  The existing
> computer with SUSE Linux Enterprise Desktop 11 which is used for
> running ANSYS, has two software RAID 0 devices made up of five 1
> TB drives.  The /home partition is 4.5 T, and it is now 4 TB
> full.  I see a fragmentation > 19%.

XFS will start to fragment when the filesystem gets beyond 85%
full - it seems that you are very close to that threshold.

That being said, if you've pulled the figure of 19% from the xfs_db
measure of fragmentation, that doesn't mean the filesystem is badly
fragmented, it just means that that there are 19% more fragments
than the ideal. In 4TB of data with 1GB sized files, that would mean
there are 4800 extents (average length ~800MB, which is excellent)
instead of the perfect 4000 extents (@1GB each). Hence you can see
how misleading this "19% fragmentation" number can be on an extent
based filesystem...

> I have just set up a new computer with 16 WD Cavair Black 1 TB
> drives connected to an Areca 1680ix-16 RAID with 4 GB cache.  14
> of these drives are in RAID 6 with 128 kB stripes.  The OS is also
> SLED 11.  The system has 16 GB memory, and AMD Phenom II X4 965
> CPU.
> 
> I have done tests writing 100 30 MB files and 1 GB, 10 GB and 20
> GB files, with single instance and multiple instances.
> 
> There is a big difference in writing speed when writing 20 GB
> files when using allocsize=1g and not using the option.  That is
> without the inode64 option, which gives further speed gains.

> I use dd for writing the 1 GB, 10 GB and 20 GB files.
> 
> mkfs.xfs -f -b size=4k -d agcount=32,su=128k,sw=12 -i size=256,align=1,attr=2 -l version=2,su=128k,lazy-count=1 -n version=2 -s size=512 -L /data /dev/sdb1
> 
> 
> defaults,nobarrier,usrquota,grpquota,noatime,nodiratime,allocsize=1g,logbufs=8,logbsize=256k,largeio,swalloc
> 
> The start of the partition has been set to LBA 3072 using GPT Fdisk to align the stripes.

This all looks good - it certainly seems that you have done your
research. ;) The only thing I'd do differently is that if you have
only one partition on the drives, I wouldn't even put a partition on it.

> The dd command is:
> 
> chingl@tsunami:/data/test/t2> dd if=/dev/zero of=bigfile20GB bs=1073741824 count=20

I'd significantly reduce the size of that buffer - too large a
buffer can slow down IO due to the memory it consumes and TLB misses
it causes. I'd typically use something like:

$ dd if=/dev/zero of=bigfile bs=1024k count=20k

Which does 20,000 writes of 1MB each and ensures the dd process
doesn't consume over a GB of RAM.

> Single instance of 20 GB dd repeats were 214, 221, 123 MB/s with
> allocsize=1g, compared to 94, 126 MB/s without.

This seems rather low for a buffered write on hardware that can
clearly go faster. SLED11 is based on 2.6.27, right? I suspect that
many of the buffered writeback issues that have been fixed since
2.6.30 are present in the SLED11 kernel, and if that is the case I
can see why the allocsize mount option makes such a big
difference.

It might be worth checking how well direct IO writes run to take
buffered writeback issues out ofthe equation. In that case, I'd use
stripe width multiple sized buffers like:

$ dd if=/dev/zero of=bigfile bs=3072k count=7k oflag=direct

I'd suggest that you might need to look at increasing the maximum IO
size for the block device (/sys/block/sdb/queue/max_sectors_kb),
maybe the request queue depth as well to get larger IOs to be pushed
to the raid controller. if you can, at least get it to the stripe
width of 1536k....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs