From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ipmail07.adl2.internode.on.net ([150.101.137.131]:1465 "EHLO ipmail07.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726401AbeIEMMs (ORCPT ); Wed, 5 Sep 2018 08:12:48 -0400 Date: Wed, 5 Sep 2018 17:43:49 +1000 From: Dave Chinner Subject: Re: mkfs.xfs options suitable for creating absurdly large XFS filesystems? Message-ID: <20180905074349.GX5631@dastard> References: <20180903224919.GA16358@redhat.com> <8045369.1GWcTSHGau@merkaba> <20180904222345.GV5631@dastard> <4003432.r53nhXgZDq@merkaba> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4003432.r53nhXgZDq@merkaba> Sender: linux-xfs-owner@vger.kernel.org List-ID: List-Id: xfs To: Martin Steigerwald Cc: "Richard W.M. Jones" , linux-xfs@vger.kernel.org On Wed, Sep 05, 2018 at 09:09:28AM +0200, Martin Steigerwald wrote: > Dave Chinner - 05.09.18, 00:23: > > On Tue, Sep 04, 2018 at 05:36:43PM +0200, Martin Steigerwald wrote: > > > Dave Chinner - 04.09.18, 02:49: > > > > On Mon, Sep 03, 2018 at 11:49:19PM +0100, Richard W.M. Jones > wrote: > > > > > [This is silly and has no real purpose except to explore the > > > > > limits. > > > > > If that offends you, don't read the rest of this email.] > > > > > > > > We do this quite frequently ourselves, even if it is just to > > > > remind > > > > ourselves how long it takes to wait for millions of IOs to be > > > > done. > > > > > > Just for the fun of it during an Linux Performance analysis & tuning > > > course I held I created a 1 EiB XFS filesystem a sparse file on > > > another XFS filesystem on an SSD of a ThinkPad T520. It took > > > several hours to create, but then it was there and mountable. AFAIR > > > the sparse file was a bit less than 20 GiB. > > > > Yup, 20GB of single sector IOs takes a long time. > > Yeah. It was interesting to see that neither the CPU nor the SSD was > fully utilized during that time tough. Right - it's not CPU bound because it's always waiting on a single IO, and it's not IO bound because it's only issuing a single IO at a time. Speaking of which, I just hacked an delayed write buffer list construct similar to the kernel code into mkfs/libxfs to batch writeback. Then I added a hacky AIO ring to allow it to drive deep IO queues. I'm seeing sustained request queue depths of ~100 and the SSDs are about 80% busy at 100,000 write IOPS. But mkfs is only consuming about 60% of a single CPU. Which means that, instead of 7-8 hours to make an 8EB filesystem, we can get it down to: $ time sudo ~/packages/mkfs.xfs -K -d size=8191p /dev/vdd meta-data=/dev/vdd isize=512 agcount=8387585, agsize=268435455 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=0 data = bsize=4096 blocks=2251524935778304, imaxpct=1 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 real 15m18.090s user 5m54.162s sys 3m49.518s Around 15 minutes on a couple of cheap consumer nvme SSDs. xfs_repair is going to need some help to scale up to this many AGs, though - phase 1 is doing a huge amount of IO just to verify the primary superblock... Cheers, Dave. -- Dave Chinner david@fromorbit.com